bash script: merge files comparing columns...
hi,
i have 2 very long files which are quite similar: file1.csv: Code:
alimentatore1,batteria1,serie1,modello1 Code:
alimentatore1,batteria1,serie1,modello1 i need this output: Code:
alimentatore1,batteria1,serie1,modello1 if the fields 3 (serie) and 4 (modello) are identical, the lines of the 2 files should be "added": Code:
,batteria2,serie2,modello4 + alimentatore2,,serie2,modello4 = alimentatore2,batteria2,serie2,modello4 Code:
alimentatore3,batteria3,serie3,modello5 thanks |
For large files, bash is slow and awk would be a better choice -- both for performance and because the language is naturally suited to sequential reading.
|
what's wrong with the command "join"?
|
Assuming the CSV files are sorted by the third and fourth fields ascending, then this awk script should be pretty fast:
Code:
awk -v "file=file1.csv" ' For me, this seems to reproduce your desired output. |
Quote:
The one from GNU Coreutils 7.4 won't join on more than one field. The nearest I got was with Code:
join -a 1 -a 2 -j 3 -t , -o '1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4' file1.csv file2.csv | \ |
Here's an alternative awk. I tried Nominal's but found that when columns 3 or 4 are blank you get some weirdness.
Code:
#!/usr/bin/awk -f Code:
./script.awk -vfile="file1" file2 |
what can i say?
just amazing! :) thank you guys, your code perfectly works... another hint, please: what if i want report lines in wich fields 3 and 4 are same and other fields are not? i.e.: file1.csv alimentatore1;batteria1;serie1;modello1 alimentatore1;batteria1;serie1;modello2 ;batteria2;serie2;modello3 alimentatore3;batteria3;serie3;modello5 file1.csv alimentatore2;batteria1;serie1;modello1 alimentatore2;;serie2;modello3 ;batteria2;serie2;modello4 ;;serie4;modello6 output: alimentatore1;batteria1;serie1;modello1 alimentatore2;batteria1;serie1;modello1 many, many thanks... |
Well try and digest what is happening in each script. The general idea should be fairly clear though, ie one files data is being looked at via the fields ($3, $4 in examples) and the
other file is being split into an array (using my example that would be f[3] and f[4]). Using this philosophy you can easily see which fields and array items you should compare for your most recent question. See how you go and let us know if you get stuck? |
sorry, i'm already stuck...
let say i have 2 files: scratch1: Code:
product01;code01; Code:
product01;code01;brand1 Code:
product01;code01;brand1 Code:
awk -v "file=scratch1" ' this is the output i get: [CODE] Code:
product01;code01;brand1 Code:
} else can anybody suggest why? many thanks, like always... |
If you follow your code through line by line, the scenario is:
Once on the second loop through scratch2 you will hit the following code: Code:
if ($1 == field[1] && $2 == field[2]) { So now it goes into the END section: Code:
END { and therefore printed. Let me know how you get on? |
sorry, i really can't understand...
i mean, the "problematic" line is the last of scratch1: product01;code02; Code:
{ what's wrong in my thought? |
Quote:
|
Quote:
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file. in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)... what am i missing? btw, i'm about for dropping and doing it with old good bash: Code:
while read LINE |
Quote:
Quote:
So let us step through the code: 1. The BEGIN - apart from setting a few variables, the only thing this gives us is the first line of scratch1 split into the fields array 2. Testing for empty fields $1 and $2 - again this is never used in our particular example 3. So the following section of code is our focus: Code:
{ Code:
product01;code01;brand1 - split into field array and test if either field is empty - they are not so if is false and break is executed (ie we leave the while loop) 3c. As per point 2 above, test is false. Test if items from both files are equal (remembering that field array was set on previous pass) - again they are both equal and $3 in scratch2 is not empty so print line below: Code:
product01;code01;brand2 Here is where you seem to be getting lost 3e. There are now no more lines in scratch2 to be read so we now jump to END (notice how this differs from 3c, ie we never reach the test you were thinking of). So END says: Code:
END { the while loop is immediately exited and therefore so is the script. I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school :) ). I personally find it much more gratifying to come across the solution myself. I can also tell you that there are plenty of changes that can be made to your bash, awk, sed, grep and cat script. The main one being that the use of so many different command are almost never required. The first would be that nearly every call to cat is not required, but that is a story for another time :) Let me know if you do not follow the logic as supplied above? |
Quote:
many, many thanks for you support :) |
Well I think it is a shame for you to give up being so close :( But I hope you get to the solution you are looking for even if it is with bash :)
Still happy to answer anymore questions should you have them (although at nearly 3am I am off to bed :) ) |
ok, i gave up with that one... too complicate...
but i just got my first awk script!! :) your efforts to teach me something were not wasted... :) Code:
#!/bin/bash Code:
field[1] = subfield[1] "_" subfield[2] "_" subfield[3] "_" subfield[4] |
Well I am not sure if this is to be part of something bigger, but if not then you do not need the 'file' variable anymore as there is only one
being used so just pass it in. Also, all the splitting is not really needed, again assuming above, but also that split allows you to enter an alternate value to split on apart from the FS value: Code:
split(string, array [, fieldsep]) Now it will depend on whether or not the letter you search for, ie. 'g' in your example, is elsewhere on the line, but the following simplified code works: Code:
awk -F";" '{split($0,f,"_");if(f[2] == "g" && $3 < 50)sub(f[2],"r")}1' prrrr.txt |
just wonderful... :)
and the most amazing thing is i can understand your code! :) many thanks, once again... |
All times are GMT -5. The time now is 03:19 AM. |