LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   bash script: merge files comparing columns... (https://www.linuxquestions.org/questions/programming-9/bash-script-merge-files-comparing-columns-874871/)

masavini 04-13-2011 05:50 PM

bash script: merge files comparing columns...
 
hi,
i have 2 very long files which are quite similar:

file1.csv:
Code:

alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
,batteria2,serie2,modello3
alimentatore3,batteria3,serie3,modello5

file2.csv:
Code:

alimentatore1,batteria1,serie1,modello1
alimentatore2,,serie2,modello3
,batteria2,serie2,modello4
,,serie4,modello6


i need this output:

Code:

alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
alimentatore2,batteria2,serie2,modello3
,batteria2,serie2,modello4
alimentatore3,batteria3,serie3,modello5
,,serie4,modello6

explanation:

if the fields 3 (serie) and 4 (modello) are identical, the lines of the 2 files should be "added":
Code:

,batteria2,serie2,modello4 + alimentatore2,,serie2,modello4 = alimentatore2,batteria2,serie2,modello4
if the fields 3 (serie) and 4 (modello) are different, just print the line of both files:
Code:

alimentatore3,batteria3,serie3,modello5
,,serie4,modello6

is there a way to do this without reading the (huge) files with a "while read line" loop?
thanks

catkin 04-14-2011 01:36 AM

For large files, bash is slow and awk would be a better choice -- both for performance and because the language is naturally suited to sequential reading.

Ramurd 04-14-2011 02:15 AM

what's wrong with the command "join"?

Nominal Animal 04-14-2011 03:10 AM

Assuming the CSV files are sorted by the third and fourth fields ascending, then this awk script should be pretty fast:
Code:

awk -v "file=file1.csv" '
    BEGIN {
        RS="[\r\n]+"
        FS=","
        OFS=","

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[3] == "" || field[4] == "") {
                print line
                continue
            }

            break
        }
    }

    ($3 == "" || $4 == "" || have < 1) {
        print $0
        next
    }

    {
        if (field[3] "," field[4] < $3 "," $4)
            while (1) {
                print line

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "")
                    continue

                if (field[3] "," field[4] >= $3 "," $4)
                    break
            }

        if ($3 == field[3] && $4 == field[4]) {
            if ($1 == "") $1 = field[1]
            if ($2 == "") $2 = field[2]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
   
' "file2.csv"

The script reads the two CSV files in parallel. All input records (lines) (in either file) with an empty third or fourth field will be output as is. For each primary input file record, it will first read records from the secondary file until the two seem to be in sync. (This is why the files need to be sorted by the third and fourth fields. You can use sort -t , -k 3 fileN.csv > sortedN.csv to sort the CSV files.) If the two files share the record, the merged record is output, and another record must then be read from the secondary file. Otherwise the primary input record is printed. Finally, if the primary file runs out first, the END section prints out the remaining secondary file records.

For me, this seems to reproduce your desired output.

Nominal Animal 04-14-2011 03:28 AM

Quote:

Originally Posted by Ramurd (Post 4324206)
what's wrong with the command "join"?

Can you reproduce the desired output with 'join'? I'd like to see your suggestion, since I couldn't.

The one from GNU Coreutils 7.4 won't join on more than one field. The nearest I got was with
Code:

join -a 1 -a 2 -j 3 -t , -o '1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4' file1.csv file2.csv | \
    awk 'BEGIN { FS="," ; OFS="," }
              { for (i = 1; i <= NF/2; i++)
                    if ($(2*i-1) == "")
                        $i = $(2*i)
                    else
                        $i=$(2*i-1)
                NF=i-1
                print $0 }'


grail 04-14-2011 11:40 AM

Here's an alternative awk. I tried Nominal's but found that when columns 3 or 4 are blank you get some weirdness.
Code:

#!/usr/bin/awk -f

BEGIN { OFS=FS="," }

{
    if( !n && getline line < file )
        split(line,f)

    n=0

    if( ! ( $3 == f[3] && $4 == f[4] ) ){
        while( $3 FS $4 > f[3] FS f[4] ){
            print line
            if(! getline line < file ) break
            split(line,f)
        }
    }
   
    if($3 == f[3] && $4 == f[4]){
        $1 = $1?$1:f[1]
        $2 = $2?$2:f[2]
    }
    else
        n = 1

}
1

You call it in the same fashion:
Code:

./script.awk -vfile="file1" file2

masavini 04-15-2011 06:11 AM

what can i say?
just amazing! :)

thank you guys, your code perfectly works...

another hint, please: what if i want report lines in wich fields 3 and 4 are same and other fields are not?

i.e.:

file1.csv
alimentatore1;batteria1;serie1;modello1
alimentatore1;batteria1;serie1;modello2
;batteria2;serie2;modello3
alimentatore3;batteria3;serie3;modello5

file1.csv
alimentatore2;batteria1;serie1;modello1
alimentatore2;;serie2;modello3
;batteria2;serie2;modello4
;;serie4;modello6

output:
alimentatore1;batteria1;serie1;modello1
alimentatore2;batteria1;serie1;modello1

many, many thanks...

grail 04-15-2011 07:38 AM

Well try and digest what is happening in each script. The general idea should be fairly clear though, ie one files data is being looked at via the fields ($3, $4 in examples) and the
other file is being split into an array (using my example that would be f[3] and f[4]).

Using this philosophy you can easily see which fields and array items you should compare for your most recent question.

See how you go and let us know if you get stuck?

masavini 04-18-2011 05:48 PM

sorry, i'm already stuck...

let say i have 2 files:

scratch1:
Code:

product01;code01;
product01;code01;brand1
product01;code02;

scratch2:
Code:

product01;code01;brand1
product01;code01;brand2

desired output:
Code:

product01;code01;brand1
product01;code01;brand2
product01;code02;

this is my code (yes, i know, not much fantasy...):
Code:

awk -v "file=scratch1" '

    BEGIN {
        RS="[\r\n]+"
        FS=";"
        OFS=";"

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[1] == "" || field[2] == "") {
                print line
                continue
            }

            break
        }
    }
   
    ($1 == "" || $2 == "" || have < 1) {
        print $0
        next
    }

    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
   
' "scratch2"


this is the output i get:
[CODE]
Code:

product01;code01;brand1
product01;code01;brand2

dont' know why, but the instruction
Code:

        } else
            print $0
    }

in the end of the code doesn't seem to work properly...

can anybody suggest why?

many thanks, like always...

grail 04-18-2011 10:38 PM

If you follow your code through line by line, the scenario is:

Once on the second loop through scratch2 you will hit the following code:
Code:

if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0 # this will print the second line from scratch2

            while (1) {

                if ((getline line < file) < 1) { # This will get the last line from scratch1
                    have = split("", field)
                    break
                }

                have = split(line, field) # split entries into field
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break # break out of loop
            }

        }

From the last entry above it will now look for the next entry in scratch2, which there is none.

So now it goes into the END section:
Code:

END {
        while ((getline line < file) > 0) # this will be 0 as we already have the last line previously
            print line
    }

As you can see, the entry already retrieved from scratch1 has been discarded due to the getline in END, prior to it being tested
and therefore printed.

Let me know how you get on?

masavini 04-19-2011 04:13 AM

sorry, i really can't understand...

i mean, the "problematic" line is the last of scratch1:
product01;code02;


Code:

    {

        if ($1 == field[1] && $2 == field[2]) {    <- this condition is not satisfied, so it should skip down
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0      <-- HERE and print the line as is
    }


what's wrong in my thought?

grail 04-19-2011 08:48 AM

Quote:

if ($1 == field[1] && $2 == field[2]) { <- this condition is not satisfied, so it should skip down
What makes you think this?

masavini 04-19-2011 09:47 AM

Quote:

Originally Posted by grail (Post 4329304)
What makes you think this?

i feel like back school again... and like if i didn't study enough... :)
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file. in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)...
what am i missing?

btw, i'm about for dropping and doing it with old good bash:

Code:

while read LINE
do
  CHECK=$(echo "$LINE" | awk -F ';' '{print $1";"$2}')
 
  rm scratch
  cat file1 | grep "^$CHECK;" > scratch
  if [ -e scratch ] && [ "$(cat scratch | awk -F ';' '{print $3}' | head -1)" != "" ]; then
      cat scratch >> new-file
      echo "$LINE" >> new-file
      sed -i /"^$CHECK;"/d file1
  else
      echo "$LINE" >> new-file
  fi
done < file2

cat new-file | sed -n '/[^;]*;[^;]*;[^;]/p' > new-file2

cat file1 new-file2 | sort | uniq > output

this perfectly works, is just VERY slow...

grail 04-19-2011 10:37 AM

Quote:

that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file
This part is correct
Quote:

in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)
But this is not. You are not following at what point you are up to in each file.

So let us step through the code:

1. The BEGIN - apart from setting a few variables, the only thing this gives us is the first line of scratch1 split into the fields array

2. Testing for empty fields $1 and $2 - again this is never used in our particular example

3. So the following section of code is our focus:
Code:

    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

3a. Are first and second fields from both files equal - for the first line in our file this is true and $3 from scratch2 is not empty so the line below is printed:
Code:

product01;code01;brand1
3b. Enter the while loop - get the next line from scratch1 which does exist so if is false.
- split into field array and test if either field is empty - they are not so if is false and break is executed (ie we leave the while loop)
3c. As per point 2 above, test is false. Test if items from both files are equal (remembering that field array was set on previous pass) - again they are both equal and $3 in scratch2 is not empty
so print line below:
Code:

product01;code01;brand2
3d. This is a copy of all steps performed in 3b above

Here is where you seem to be getting lost

3e. There are now no more lines in scratch2 to be read so we now jump to END (notice how this differs from 3c, ie we never reach the test you were thinking of). So END says:
Code:

END {
    while ((getline line < file) > 0)
        print line
}

So it requests that we get the next line from scratch1, but we already retrieved the last line at 3d above. So now the getline test will fail as it is not greater than zero, hence
the while loop is immediately exited and therefore so is the script.

I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school :) ).
I personally find it much more gratifying to come across the solution myself.

I can also tell you that there are plenty of changes that can be made to your bash, awk, sed, grep and cat script. The main one being that the use of so many different
command are almost never required. The first would be that nearly every call to cat is not required, but that is a story for another time :)

Let me know if you do not follow the logic as supplied above?

masavini 04-19-2011 12:08 PM

Quote:

I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.
i find learning gratifying as well, but if gratifications are balanced or exceeded by frustrations, it's not a deal: the question i posted is just one of the dozens issues i'm finding writing this awk script, so i guess i'd better give up...

many, many thanks for you support :)

grail 04-19-2011 12:29 PM

Well I think it is a shame for you to give up being so close :( But I hope you get to the solution you are looking for even if it is with bash :)

Still happy to answer anymore questions should you have them (although at nearly 3am I am off to bed :) )

masavini 04-21-2011 12:33 PM

ok, i gave up with that one... too complicate...
but i just got my first awk script!! :)
your efforts to teach me something were not wasted... :)

Code:

#!/bin/bash

cd /tmp/

echo -e "dv2000_g_52_b;5;50.01;\ndv2000_g_52_b;5;49.99;" > prrrr.txt

awk -v "file=prrrr.txt" '

BEGIN        {
                RS="[\r\n]+"
                FS=";"
                OFS=";"
               
                have = split("", field)
                while ((getline line < file) > 0) {
                        FS=";"
                        have = split(line, field)
                       
                        FS="_"
                        have = split(field[1], subfield)
                       
                        if (subfield[2] == "g" && field[3] < 50) {
                                subfield[2] = "r"
                                field[1] = subfield[1] "_" subfield[2] "_" subfield[3] "_" subfield[4]
                                print field[1] ";" field[2] ";" field[3] ";"
                        }
                                else
                                print line
                }
        }'

a small question: is there a way to let this part a bit "lighter"?

Code:

                                field[1] = subfield[1] "_" subfield[2] "_" subfield[3] "_" subfield[4]
                                print field[1] ";" field[2] ";" field[3] ";"


grail 04-22-2011 12:39 AM

Well I am not sure if this is to be part of something bigger, but if not then you do not need the 'file' variable anymore as there is only one
being used so just pass it in. Also, all the splitting is not really needed, again assuming above, but also that split allows you to enter
an alternate value to split on apart from the FS value:
Code:

split(string, array [, fieldsep])
So you can see an optional fieldsep is able to be used.

Now it will depend on whether or not the letter you search for, ie. 'g' in your example, is elsewhere on the line, but the following simplified code works:
Code:

awk -F";" '{split($0,f,"_");if(f[2] == "g" && $3 < 50)sub(f[2],"r")}1' prrrr.txt

masavini 04-23-2011 05:01 AM

just wonderful... :)
and the most amazing thing is i can understand your code! :)
many thanks, once again...


All times are GMT -5. The time now is 03:19 AM.