help extracting data from csv file
I have several csv files that contain anywhere from 70,000 to 150,000 rows of data (only one column) in integer form. I need to extract every two integers (pairing) that differs by between 57.9 and 58.1 from one another and place them in another file. All integers that do not have another integer that differs by said amount is unimportant. I would like to have a script to find these pairings within a csv file and another script to find these pairings between two csv files. Your help is greatly appreciated. Thank you for your time.
Will |
An example data set here would help a lot.
Are we talking Code:
A Code:
file1 file2 And you don't care if the difference is greater or less than those values but only in that range... right? |
Correct. Has to be in that range. And the files are just like that with integers to 4 decimal places. Thanks.
|
Note that the two file example is actually fairly simple.
Code:
paste $file1 $file2 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3 $file1 = the first file with data in it. $file2 = the second file with data in it. $file3 = the new file listing the pairs in csv form two columns. If they're in the one file form... there is a way to do it fairly simply (the awk part is the same) but it is slipping my mind at the moment. Edit: Ah, it's a classic sed one-liner. Code:
sed '$!N;s/\n/ /' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3 Edit2: Or you can use paste again Code:
paste -s -d '\t\n' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3 |
I need to clarify the file formats:
There are 70,000 cells in one column. I need to check each cell against every other cell for the above mentioned difference. They are not already paired up. So, cell A1 may be 58 apart from cell A125 or something like that. Thanks for the help and prompt response. I apologize for my lack of clarity. Will |
You'll have to provide me with an example because I have no idea what you're talking about. For one thing, how do you know that A1 and A125 are related? Are we talking more than one column or not? What is the relation between these "fields" and how do you determine what is what? This is why I needed to clarify what the files looked like, although I may have not been clear when I gave my prototype examples.
This may be slightly more complicated than the above but probably not to the point where you need to go beyond the existant tools. |
Now, when you say that all other values are unimportant, do you mean we toss all the values between them or do we need to check those values for pairings as we recurse through it all? The problem I'm having is that I'm unclear on the specs.
|
So if there is an integer in these 70,000 integers that doesn't differ from another integer by 57.9 to 58.1, then I don't need it. So, every integer that does partner with any other integer should be paired with that other integer in A, B format if possible. I know I'm confusing, but english really is my first language :)
|
And yes, check all integers for partners.
|
Does the order of the values matter? Can a value be used more than once? In the two file example, does appending the file change the results you'd expect from interleaving it?
For example would: Code:
file1 file2 Code:
file1 |
Note: The following code assumes that once you've used a value you don't want it to be used again. Remove the line with the =99999.999999 in it, if you want values to be reused. Save the following as compare.awk
Code:
# This is a function to compare each new value with all the others You call it for one file with: Code:
awk -f compare.awk $file1 > $file3 Code:
paste $file1 $file2 | awk -f compare.awk > $file3 Code:
awk -f compare.awk $file1 $file2 > $file3 |
All times are GMT -5. The time now is 03:38 PM. |