LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-08-2012, 02:47 PM   #1
GamezR2EZ
Member
 
Registered: Jan 2009
Posts: 31

Rep: Reputation: 4
Compare multidimensional array; vote on bytes to use


I am comparing sections of data that is assumed to be corrupt and I need to get the most statistically probably data out of it.

I have a multidimensional array filed with a set amount of data in the 2nd dimension. I am comparing the data 1 byte at a time. I need a way to keep track of "votes" for each byte.

While I am working with C, any language would be helpful, as I do not see a direction to start working in my head currently.

Here is what I have so far:
Code:
  char vote[128][256] = { 0 };

  //char dataArray gets filled in with info

  for(i=0; i<3; i++){
    for(j=0; j<128; j++){
      vote[j][dataArray[i][j]]++;
    }
  }
Now this works at the basic level. It will vote on each byte in the array.

The issue is that I then have to go back and compare the vote array to find which byte as the most votes and recreate the 128 byte section based on those votes. I can't figure that code out.

This is not memory efficient at all. I could use some direction.

Possibly relevant, operating system is Linux based.


---Solution

Code:
  //assuming 4096 bytes of data
  for(i=0; i<4096; i++){
    for(j=0; j<((argc/2)-1); j++){
      //Vote increments the array by 1
      //vote[byte number in array][actual byte from data]
      vote[i][dataArray[j][i]]++;
    }
  }
  for(i=0; i<4096; i++){
    votes = 0;
    char byte;
    for(j=0; j<256; j++){
      //If data byte has higher number of votes that current highest votes..
      if(vote[i][j]>votes){
        //set votes to new high number
	votes = vote[i][j];
        //set byte to current most probable byte
	byte = j;
      }
    }

    //get required number of matches from arguments
    num = strtol(argv[1], NULL, 10);
    
    if(votes>=num){
      //write statistically likely byte to file
      fwrite(&byte, sizeof(char), sizeof(char), final);
    } else {
      //Break the loop if the required number of matches isn't found on one byte
      printf("Could not find %d matches for byte %d\n", num, (i+1));
      break;
    }
  }
Nothing like posting on a forum to make you figure out a solution quicker.
That and stepping away from the problem for a few hours.

Last edited by GamezR2EZ; 03-09-2012 at 01:49 PM.
 
Old 03-09-2012, 08:13 AM   #2
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
You stated the requirement unclearly, so I will try to restate it.

You have X versions of a chunk of data Y bytes long and at each position across Y you want to select the value based on maximum popularity across X.

The data structures are obviously simpler if you have an outer loop over Y and an inner loop over X. But for large enough X*Y that would cause cache performance issues or even run into virtual memory limits. You seem to want an outer loop over X (inner over Y) maybe for those good reasons or maybe because you didn't realize the opposite would be simpler.

If X is quite large, it may be efficient to accumulate across all possible values of the data type (256 possible values of byte) as you did in your sample code. But for small X, that is very inefficient. You gave 4 as an example of X.

Consider:
Code:
Loop over Y
  Start a temp vector T
  Loop over X
    Append data[x][y] to T
  Sort T (call some standard sort function)
  Loop over T finding the longest run of matching values.
You also implied the data is not so badly corrupted that it cannot be reconstructed statistically. That implies a fairly high match rate. That implies a wide range of X where X is large enough that an associative container is better than the instance vector I suggested above, but X is not so large that the count vector you suggested is better than an associative container. Using an associative container is obviously easier in C++ than in C, but is possible in almost any language.

For very large Y, the performance vs. complexity tradeoff may make the choice of an associative container important.

If your 4x128 with 256 possible values was real rather than example, don't worry about any time, size, cache, or other performance issues. 128x256 is too small to care that it may be wildly inefficient.

Quote:
Originally Posted by GamezR2EZ View Post
The issue is that I then have to go back and compare the vote array to find which byte as the most votes and recreate the 128 byte section based on those votes. I can't figure that code out.
Why is that hard? Outer loop over the 128 dimension, and inner loop over the 256 dimension to find the position of the largest number of those 256 numbers. I'm sure you understand the position where you found the largest count is the byte value to "reconstruct". I can't imagine it is hard to find the largest of 256 numbers.

Last edited by johnsfine; 03-09-2012 at 08:32 AM.
 
1 members found this post helpful.
Old 03-09-2012, 10:19 AM   #3
GamezR2EZ
Member
 
Registered: Jan 2009
Posts: 31

Original Poster
Rep: Reputation: 4
Thank you for the response. I woke up this morning and had a much better idea of how to do what I needed. I am currently trying to implement the method you showed and I am making progress. I do have a working method in place, but you are correct, there is a high match rate.

The number of X chunks is variable, but will most likely be 512 or 4096. In either case, this number will be dynamic passed as an argument.

What I currently have working is below:

Code:
  for(i=0; i<4096; i++){
    for(j=0; j<((argc/2)-1); j++){
      vote[i][dataArray[j][i]]++;
    }
  }
  //assuming 4096 until code works
  for(i=0; i<4096; i++){
    for(j=0; j<256; j++){
      if(vote[i][j]>votes){
	votes = vote[i][j];
	byte = j;
      }
    }

    //get required number of matches from arguments
    num = strtol(argv[1], NULL, 10);
    
    if(votes>=num){
      //write statistically likely byte to file
      fwrite(&byte, sizeof(char), sizeof(char), final);
    } else {
      printf("Could not find %d matches for byte %d\n", num, (i+1));
      break;
    }
    votes = 0;
    byte = 0;
  }
And you are right, "find which byte has the most votes" was simple. I had just been staring at the same code for to long.

I am working on an associative container as you suggested now.

Thanks
 
Old 03-09-2012, 12:42 PM   #4
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
You left out the line that sets votes to zero within the loop.

Code:
 
...
  for(i=0; i<4096; i++){
    votes = 0;
    for(j=0; j<256; j++){
      if(vote[i][j]>votes){
...
In C99 or C++ it is also better practice to declare variables where initialized, rather than in an outer scope. Narrower scopes leads to more maintainable code. One of the flaws with ANSI C is that it doesn't allow all the narrower scopes one would want.

Code:
 
...
  for(int i=0; i<4096; i++){
    int votes = 0;
    char byte;
    for(int j=0; j<256; j++){
      if(vote[i][j]>votes){
...

Last edited by johnsfine; 03-09-2012 at 12:48 PM.
 
Old 03-09-2012, 01:38 PM   #5
GamezR2EZ
Member
 
Registered: Jan 2009
Posts: 31

Original Poster
Rep: Reputation: 4
votes was set to 0 at the end of the loop. I am sure that is not best practice, but it is the logical flow in my head. I expect the variable to be initialized, or set to a value in some cases.

I tend to abuse variables by reusing them in weird ways (or so I have been told, it is normal ways for me). The code I write tends works, but it isn't always the "best practice" way of doing things simply because I do not know best practice.

I just do this as a hobby when I need a simple program like this. I will try to implement that in the future.

I am going to go ahead and mark this solved and update the top post. I will update again when I have the second solution we discussed ready.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
perl - multidimensional array toredo Programming 3 06-11-2010 08:29 AM
Perl multidimensional array zhjim Programming 9 08-21-2009 09:59 AM
How to allocate memory for a multidimensional array ? indian Programming 12 11-19-2004 03:37 AM
multidimensional array (php) niehls Programming 2 01-31-2003 05:34 AM
PHP: Multidimensional array problem lhoff Programming 1 06-10-2002 02:49 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration