[SOLVED] bash script: merge files comparing columns...

masavini · 04-13-2011, 05:50 PM

hi,
i have 2 very long files which are quite similar:

file1.csv:

Code:

alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
,batteria2,serie2,modello3
alimentatore3,batteria3,serie3,modello5

file2.csv:

Code:

alimentatore1,batteria1,serie1,modello1
alimentatore2,,serie2,modello3
,batteria2,serie2,modello4
,,serie4,modello6

i need this output:

Code:

alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
alimentatore2,batteria2,serie2,modello3
,batteria2,serie2,modello4
alimentatore3,batteria3,serie3,modello5
,,serie4,modello6

explanation:

if the fields 3 (serie) and 4 (modello) are identical, the lines of the 2 files should be "added":

Code:

,batteria2,serie2,modello4 + alimentatore2,,serie2,modello4 = alimentatore2,batteria2,serie2,modello4

if the fields 3 (serie) and 4 (modello) are different, just print the line of both files:

Code:

alimentatore3,batteria3,serie3,modello5
,,serie4,modello6

is there a way to do this without reading the (huge) files with a "while read line" loop?
thanks

catkin · 04-14-2011, 01:36 AM

For large files, bash is slow and awk would be a better choice -- both for performance and because the language is naturally suited to sequential reading.

Ramurd · 04-14-2011, 02:15 AM

what's wrong with the command "join"?

Nominal Animal · 04-14-2011, 03:10 AM

Assuming the CSV files are sorted by the third and fourth fields ascending, then this awk script should be pretty fast:

Code:

awk -v "file=file1.csv" '
    BEGIN {
        RS="[\r\n]+"
        FS=","
        OFS=","

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[3] == "" || field[4] == "") {
                print line
                continue
            }

            break
        }
    }

    ($3 == "" || $4 == "" || have < 1) {
        print $0
        next
    }

    {
        if (field[3] "," field[4] < $3 "," $4)
            while (1) {
                print line

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "")
                    continue

                if (field[3] "," field[4] >= $3 "," $4)
                    break
            }

        if ($3 == field[3] && $4 == field[4]) {
            if ($1 == "") $1 = field[1]
            if ($2 == "") $2 = field[2]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
    
' "file2.csv"

The script reads the two CSV files in parallel. All input records (lines) (in either file) with an empty third or fourth field will be output as is. For each primary input file record, it will first read records from the secondary file until the two seem to be in sync. (This is why the files need to be sorted by the third and fourth fields. You can use sort -t , -k 3 fileN.csv > sortedN.csv to sort the CSV files.) If the two files share the record, the merged record is output, and another record must then be read from the secondary file. Otherwise the primary input record is printed. Finally, if the primary file runs out first, the END section prints out the remaining secondary file records.

For me, this seems to reproduce your desired output.

Nominal Animal · 04-14-2011, 03:28 AM

Quote:

Originally Posted by Ramurd

what's wrong with the command "join"?

Can you reproduce the desired output with 'join'? I'd like to see your suggestion, since I couldn't.

The one from GNU Coreutils 7.4 won't join on more than one field. The nearest I got was with

Code:

join -a 1 -a 2 -j 3 -t , -o '1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4' file1.csv file2.csv | \
    awk 'BEGIN { FS="," ; OFS="," }
               { for (i = 1; i <= NF/2; i++)
                     if ($(2*i-1) == "")
                         $i = $(2*i)
                     else
                         $i=$(2*i-1)
                 NF=i-1
                 print $0 }'

grail · 04-14-2011, 11:40 AM

Here's an alternative awk. I tried Nominal's but found that when columns 3 or 4 are blank you get some weirdness.

Code:

#!/usr/bin/awk -f

BEGIN { OFS=FS="," }

{
    if( !n && getline line < file )
        split(line,f)

    n=0

    if( ! ( $3 == f[3] && $4 == f[4] ) ){
        while( $3 FS $4 > f[3] FS f[4] ){
            print line
            if(! getline line < file ) break
            split(line,f)
        }
    }
    
    if($3 == f[3] && $4 == f[4]){
        $1 = $1?$1:f[1]
        $2 = $2?$2:f[2]
    }
    else
        n = 1

}
1

You call it in the same fashion:

Code:

./script.awk -vfile="file1" file2

masavini · 04-15-2011, 06:11 AM

what can i say?
just amazing!

thank you guys, your code perfectly works...

another hint, please: what if i want report lines in wich fields 3 and 4 are same and other fields are not?

i.e.:

file1.csv
alimentatore1;batteria1;serie1;modello1
alimentatore1;batteria1;serie1;modello2
;batteria2;serie2;modello3
alimentatore3;batteria3;serie3;modello5

file1.csv
alimentatore2;batteria1;serie1;modello1
alimentatore2;;serie2;modello3
;batteria2;serie2;modello4
;;serie4;modello6

output:
alimentatore1;batteria1;serie1;modello1
alimentatore2;batteria1;serie1;modello1

many, many thanks...

grail · 04-15-2011, 07:38 AM

Well try and digest what is happening in each script. The general idea should be fairly clear though, ie one files data is being looked at via the fields ($3, $4 in examples) and the
other file is being split into an array (using my example that would be f[3] and f[4]).

Using this philosophy you can easily see which fields and array items you should compare for your most recent question.

See how you go and let us know if you get stuck?

masavini · 04-18-2011, 05:48 PM

sorry, i'm already stuck...

let say i have 2 files:

scratch1:

Code:

product01;code01;
product01;code01;brand1
product01;code02;

scratch2:

Code:

product01;code01;brand1
product01;code01;brand2

desired output:

Code:

product01;code01;brand1
product01;code01;brand2
product01;code02;

this is my code (yes, i know, not much fantasy...):

Code:

awk -v "file=scratch1" '

    BEGIN {
        RS="[\r\n]+"
        FS=";"
        OFS=";"

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[1] == "" || field[2] == "") {
                print line
                continue
            }

            break
        }
    }
    
    ($1 == "" || $2 == "" || have < 1) {
        print $0
        next
    }

    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
    
' "scratch2"

this is the output i get:
[CODE]

Code:

product01;code01;brand1
product01;code01;brand2

dont' know why, but the instruction

Code:

        } else
            print $0
    }

in the end of the code doesn't seem to work properly...

can anybody suggest why?

many thanks, like always...

grail · 04-18-2011, 10:38 PM

If you follow your code through line by line, the scenario is:

Once on the second loop through scratch2 you will hit the following code:

Code:

if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0 # this will print the second line from scratch2

            while (1) {

                if ((getline line < file) < 1) { # This will get the last line from scratch1
                    have = split("", field)
                    break
                }

                have = split(line, field) # split entries into field
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break # break out of loop
            }

        }

From the last entry above it will now look for the next entry in scratch2, which there is none.

So now it goes into the END section:

Code:

END {
        while ((getline line < file) > 0) # this will be 0 as we already have the last line previously
            print line
    }

As you can see, the entry already retrieved from scratch1 has been discarded due to the getline in END, prior to it being tested
and therefore printed.

Let me know how you get on?

masavini · 04-19-2011, 04:13 AM

sorry, i really can't understand...

i mean, the "problematic" line is the last of scratch1:
product01;code02;

Code:

    {

        if ($1 == field[1] && $2 == field[2]) {    <- this condition is not satisfied, so it should skip down
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0       <-- HERE and print the line as is
    }

what's wrong in my thought?

grail · 04-19-2011, 08:48 AM

Quote:

if ($1 == field[1] && $2 == field[2]) { <- this condition is not satisfied, so it should skip down

What makes you think this?

masavini · 04-19-2011, 09:47 AM

Quote:

Originally Posted by grail

What makes you think this?

i feel like back school again... and like if i didn't study enough...

that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file. in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)...
what am i missing?

btw, i'm about for dropping and doing it with old good bash:

Code:

while read LINE
do
   CHECK=$(echo "$LINE" | awk -F ';' '{print $1";"$2}')
   
   rm scratch
   cat file1 | grep "^$CHECK;" > scratch
   if [ -e scratch ] && [ "$(cat scratch | awk -F ';' '{print $3}' | head -1)" != "" ]; then
      cat scratch >> new-file
      echo "$LINE" >> new-file
      sed -i /"^$CHECK;"/d file1
   else
      echo "$LINE" >> new-file
   fi
done < file2

cat new-file | sed -n '/[^;]*;[^;]*;[^;]/p' > new-file2

cat file1 new-file2 | sort | uniq > output

this perfectly works, is just VERY slow...

grail · 04-19-2011, 10:37 AM

Quote:

that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file

This part is correct

Quote:

in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)

But this is not. You are not following at what point you are up to in each file.

So let us step through the code:

1. The BEGIN - apart from setting a few variables, the only thing this gives us is the first line of scratch1 split into the fields array

2. Testing for empty fields $1 and $2 - again this is never used in our particular example

3. So the following section of code is our focus:

Code:

    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

3a. Are first and second fields from both files equal - for the first line in our file this is true and $3 from scratch2 is not empty so the line below is printed:

Code:

product01;code01;brand1

3b. Enter the while loop - get the next line from scratch1 which does exist so if is false.
- split into field array and test if either field is empty - they are not so if is false and break is executed (ie we leave the while loop)
3c. As per point 2 above, test is false. Test if items from both files are equal (remembering that field array was set on previous pass) - again they are both equal and $3 in scratch2 is not empty
so print line below:

Code:

product01;code01;brand2

3d. This is a copy of all steps performed in 3b above

Here is where you seem to be getting lost

3e. There are now no more lines in scratch2 to be read so we now jump to END (notice how this differs from 3c, ie we never reach the test you were thinking of). So END says:

Code:

END {
    while ((getline line < file) > 0)
        print line
}

So it requests that we get the next line from scratch1, but we already retrieved the last line at 3d above. So now the getline test will fail as it is not greater than zero, hence
the while loop is immediately exited and therefore so is the script.

I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school

).
I personally find it much more gratifying to come across the solution myself.

I can also tell you that there are plenty of changes that can be made to your bash, awk, sed, grep and cat script. The main one being that the use of so many different
command are almost never required. The first would be that nearly every call to cat is not required, but that is a story for another time

Let me know if you do not follow the logic as supplied above?

masavini · 04-19-2011, 12:08 PM

Quote:

I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.

i find learning gratifying as well, but if gratifications are balanced or exceeded by frustrations, it's not a deal: the question i posted is just one of the dozens issues i'm finding writing this awk script, so i guess i'd better give up...

many, many thanks for you support