[SOLVED] Compare multidimensional array; vote on bytes to use

GamezR2EZ · 03-08-2012, 02:47 PM

I am comparing sections of data that is assumed to be corrupt and I need to get the most statistically probably data out of it.

I have a multidimensional array filed with a set amount of data in the 2nd dimension. I am comparing the data 1 byte at a time. I need a way to keep track of "votes" for each byte.

While I am working with C, any language would be helpful, as I do not see a direction to start working in my head currently.

Here is what I have so far:

Code:

  char vote[128][256] = { 0 };

  //char dataArray gets filled in with info

  for(i=0; i<3; i++){
    for(j=0; j<128; j++){
      vote[j][dataArray[i][j]]++;
    }
  }

Now this works at the basic level. It will vote on each byte in the array.

The issue is that I then have to go back and compare the vote array to find which byte as the most votes and recreate the 128 byte section based on those votes. I can't figure that code out.

This is not memory efficient at all. I could use some direction.

Possibly relevant, operating system is Linux based.

---Solution

Code:

  //assuming 4096 bytes of data
  for(i=0; i<4096; i++){
    for(j=0; j<((argc/2)-1); j++){
      //Vote increments the array by 1
      //vote[byte number in array][actual byte from data]
      vote[i][dataArray[j][i]]++;
    }
  }
  for(i=0; i<4096; i++){
    votes = 0;
    char byte;
    for(j=0; j<256; j++){
      //If data byte has higher number of votes that current highest votes..
      if(vote[i][j]>votes){
        //set votes to new high number
	votes = vote[i][j];
        //set byte to current most probable byte
	byte = j;
      }
    }

    //get required number of matches from arguments
    num = strtol(argv[1], NULL, 10);
    
    if(votes>=num){
      //write statistically likely byte to file
      fwrite(&byte, sizeof(char), sizeof(char), final);
    } else {
      //Break the loop if the required number of matches isn't found on one byte
      printf("Could not find %d matches for byte %d\n", num, (i+1));
      break;
    }
  }

Nothing like posting on a forum to make you figure out a solution quicker.
That and stepping away from the problem for a few hours.

johnsfine · 03-09-2012, 08:13 AM

You stated the requirement unclearly, so I will try to restate it.

You have X versions of a chunk of data Y bytes long and at each position across Y you want to select the value based on maximum popularity across X.

The data structures are obviously simpler if you have an outer loop over Y and an inner loop over X. But for large enough X*Y that would cause cache performance issues or even run into virtual memory limits. You seem to want an outer loop over X (inner over Y) maybe for those good reasons or maybe because you didn't realize the opposite would be simpler.

If X is quite large, it may be efficient to accumulate across all possible values of the data type (256 possible values of byte) as you did in your sample code. But for small X, that is very inefficient. You gave 4 as an example of X.

Consider:

Code:

Loop over Y
  Start a temp vector T
  Loop over X
    Append data[x][y] to T
  Sort T (call some standard sort function)
  Loop over T finding the longest run of matching values.

You also implied the data is not so badly corrupted that it cannot be reconstructed statistically. That implies a fairly high match rate. That implies a wide range of X where X is large enough that an associative container is better than the instance vector I suggested above, but X is not so large that the count vector you suggested is better than an associative container. Using an associative container is obviously easier in C++ than in C, but is possible in almost any language.

For very large Y, the performance vs. complexity tradeoff may make the choice of an associative container important.

If your 4x128 with 256 possible values was real rather than example, don't worry about any time, size, cache, or other performance issues. 128x256 is too small to care that it may be wildly inefficient.

Quote:

Originally Posted by GamezR2EZ

The issue is that I then have to go back and compare the vote array to find which byte as the most votes and recreate the 128 byte section based on those votes. I can't figure that code out.

Why is that hard? Outer loop over the 128 dimension, and inner loop over the 256 dimension to find the position of the largest number of those 256 numbers. I'm sure you understand the position where you found the largest count is the byte value to "reconstruct". I can't imagine it is hard to find the largest of 256 numbers.

GamezR2EZ · 03-09-2012, 10:19 AM

Thank you for the response. I woke up this morning and had a much better idea of how to do what I needed. I am currently trying to implement the method you showed and I am making progress. I do have a working method in place, but you are correct, there is a high match rate.

The number of X chunks is variable, but will most likely be 512 or 4096. In either case, this number will be dynamic passed as an argument.

What I currently have working is below:

Code:

  for(i=0; i<4096; i++){
    for(j=0; j<((argc/2)-1); j++){
      vote[i][dataArray[j][i]]++;
    }
  }
  //assuming 4096 until code works
  for(i=0; i<4096; i++){
    for(j=0; j<256; j++){
      if(vote[i][j]>votes){
	votes = vote[i][j];
	byte = j;
      }
    }

    //get required number of matches from arguments
    num = strtol(argv[1], NULL, 10);
    
    if(votes>=num){
      //write statistically likely byte to file
      fwrite(&byte, sizeof(char), sizeof(char), final);
    } else {
      printf("Could not find %d matches for byte %d\n", num, (i+1));
      break;
    }
    votes = 0;
    byte = 0;
  }

And you are right, "find which byte has the most votes" was simple. I had just been staring at the same code for to long.

I am working on an associative container as you suggested now.

Thanks

johnsfine · 03-09-2012, 12:42 PM

You left out the line that sets votes to zero within the loop.

Code:

 
...
  for(i=0; i<4096; i++){
    votes = 0;
    for(j=0; j<256; j++){
      if(vote[i][j]>votes){
...

In C99 or C++ it is also better practice to declare variables where initialized, rather than in an outer scope. Narrower scopes leads to more maintainable code. One of the flaws with ANSI C is that it doesn't allow all the narrower scopes one would want.

Code:

 
...
  for(int i=0; i<4096; i++){
    int votes = 0;
    char byte;
    for(int j=0; j<256; j++){
      if(vote[i][j]>votes){
...

GamezR2EZ · 03-09-2012, 01:38 PM

votes was set to 0 at the end of the loop. I am sure that is not best practice, but it is the logical flow in my head. I expect the variable to be initialized, or set to a value in some cases.

I tend to abuse variables by reusing them in weird ways (or so I have been told, it is normal ways for me). The code I write tends works, but it isn't always the "best practice" way of doing things simply because I do not know best practice.

I just do this as a hobby when I need a simple program like this. I will try to implement that in the future.

I am going to go ahead and mark this solved and update the top post. I will update again when I have the second solution we discussed ready.