LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-13-2011, 06:50 PM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 269

Rep: Reputation: 6
bash script: merge files comparing columns...


hi,
i have 2 very long files which are quite similar:

file1.csv:
Code:
alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
,batteria2,serie2,modello3
alimentatore3,batteria3,serie3,modello5
file2.csv:
Code:
alimentatore1,batteria1,serie1,modello1
alimentatore2,,serie2,modello3
,batteria2,serie2,modello4
,,serie4,modello6

i need this output:

Code:
alimentatore1,batteria1,serie1,modello1
alimentatore1,batteria1,serie1,modello2
alimentatore2,batteria2,serie2,modello3
,batteria2,serie2,modello4
alimentatore3,batteria3,serie3,modello5
,,serie4,modello6
explanation:

if the fields 3 (serie) and 4 (modello) are identical, the lines of the 2 files should be "added":
Code:
,batteria2,serie2,modello4 + alimentatore2,,serie2,modello4 = alimentatore2,batteria2,serie2,modello4
if the fields 3 (serie) and 4 (modello) are different, just print the line of both files:
Code:
alimentatore3,batteria3,serie3,modello5
,,serie4,modello6
is there a way to do this without reading the (huge) files with a "while read line" loop?
thanks

Last edited by masavini; 04-13-2011 at 06:53 PM.
 
Old 04-14-2011, 02:36 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198
For large files, bash is slow and awk would be a better choice -- both for performance and because the language is naturally suited to sequential reading.
 
Old 04-14-2011, 03:15 AM   #3
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 698

Rep: Reputation: 110Reputation: 110
what's wrong with the command "join"?
 
Old 04-14-2011, 04:10 AM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Assuming the CSV files are sorted by the third and fourth fields ascending, then this awk script should be pretty fast:
Code:
awk -v "file=file1.csv" '
    BEGIN {
        RS="[\r\n]+"
        FS=","
        OFS=","

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[3] == "" || field[4] == "") {
                print line
                continue
            }

            break
        }
    }

    ($3 == "" || $4 == "" || have < 1) {
        print $0
        next
    }

    {
        if (field[3] "," field[4] < $3 "," $4)
            while (1) {
                print line

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "")
                    continue

                if (field[3] "," field[4] >= $3 "," $4)
                    break
            }

        if ($3 == field[3] && $4 == field[4]) {
            if ($1 == "") $1 = field[1]
            if ($2 == "") $2 = field[2]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[3] == "" || field[4] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
    
' "file2.csv"
The script reads the two CSV files in parallel. All input records (lines) (in either file) with an empty third or fourth field will be output as is. For each primary input file record, it will first read records from the secondary file until the two seem to be in sync. (This is why the files need to be sorted by the third and fourth fields. You can use sort -t , -k 3 fileN.csv > sortedN.csv to sort the CSV files.) If the two files share the record, the merged record is output, and another record must then be read from the secondary file. Otherwise the primary input record is printed. Finally, if the primary file runs out first, the END section prints out the remaining secondary file records.

For me, this seems to reproduce your desired output.
 
Old 04-14-2011, 04:28 AM   #5
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by Ramurd View Post
what's wrong with the command "join"?
Can you reproduce the desired output with 'join'? I'd like to see your suggestion, since I couldn't.

The one from GNU Coreutils 7.4 won't join on more than one field. The nearest I got was with
Code:
join -a 1 -a 2 -j 3 -t , -o '1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4' file1.csv file2.csv | \
    awk 'BEGIN { FS="," ; OFS="," }
               { for (i = 1; i <= NF/2; i++)
                     if ($(2*i-1) == "")
                         $i = $(2*i)
                     else
                         $i=$(2*i-1)
                 NF=i-1
                 print $0 }'
 
Old 04-14-2011, 12:40 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,804

Rep: Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069
Here's an alternative awk. I tried Nominal's but found that when columns 3 or 4 are blank you get some weirdness.
Code:
#!/usr/bin/awk -f

BEGIN { OFS=FS="," }

{
    if( !n && getline line < file )
        split(line,f)

    n=0

    if( ! ( $3 == f[3] && $4 == f[4] ) ){
        while( $3 FS $4 > f[3] FS f[4] ){
            print line
            if(! getline line < file ) break
            split(line,f)
        }
    }
    
    if($3 == f[3] && $4 == f[4]){
        $1 = $1?$1:f[1]
        $2 = $2?$2:f[2]
    }
    else
        n = 1

}
1
You call it in the same fashion:
Code:
./script.awk -vfile="file1" file2
 
Old 04-15-2011, 07:11 AM   #7
masavini
Member
 
Registered: Jun 2008
Posts: 269

Original Poster
Rep: Reputation: 6
what can i say?
just amazing!

thank you guys, your code perfectly works...

another hint, please: what if i want report lines in wich fields 3 and 4 are same and other fields are not?

i.e.:

file1.csv
alimentatore1;batteria1;serie1;modello1
alimentatore1;batteria1;serie1;modello2
;batteria2;serie2;modello3
alimentatore3;batteria3;serie3;modello5

file1.csv
alimentatore2;batteria1;serie1;modello1
alimentatore2;;serie2;modello3
;batteria2;serie2;modello4
;;serie4;modello6

output:
alimentatore1;batteria1;serie1;modello1
alimentatore2;batteria1;serie1;modello1

many, many thanks...
 
Old 04-15-2011, 08:38 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,804

Rep: Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069
Well try and digest what is happening in each script. The general idea should be fairly clear though, ie one files data is being looked at via the fields ($3, $4 in examples) and the
other file is being split into an array (using my example that would be f[3] and f[4]).

Using this philosophy you can easily see which fields and array items you should compare for your most recent question.

See how you go and let us know if you get stuck?
 
Old 04-18-2011, 06:48 PM   #9
masavini
Member
 
Registered: Jun 2008
Posts: 269

Original Poster
Rep: Reputation: 6
sorry, i'm already stuck...

let say i have 2 files:

scratch1:
Code:
product01;code01;
product01;code01;brand1
product01;code02;
scratch2:
Code:
product01;code01;brand1
product01;code01;brand2
desired output:
Code:
product01;code01;brand1
product01;code01;brand2
product01;code02;
this is my code (yes, i know, not much fantasy...):
Code:
awk -v "file=scratch1" '

    BEGIN {
        RS="[\r\n]+"
        FS=";"
        OFS=";"

        have = split("", field)
        while ((getline line < file) > 0) {
            have = split(line, field)
            if (field[1] == "" || field[2] == "") {
                print line
                continue
            }

            break
        }
    }
    
    ($1 == "" || $2 == "" || have < 1) {
        print $0
        next
    }

    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }

    END {
        while ((getline line < file) > 0)
            print line
    }
    
' "scratch2"

this is the output i get:
[CODE]
Code:
product01;code01;brand1
product01;code01;brand2
dont' know why, but the instruction
Code:
        } else
            print $0
    }
in the end of the code doesn't seem to work properly...

can anybody suggest why?

many thanks, like always...
 
Old 04-18-2011, 11:38 PM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,804

Rep: Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069
If you follow your code through line by line, the scenario is:

Once on the second loop through scratch2 you will hit the following code:
Code:
if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0 # this will print the second line from scratch2

            while (1) {

                if ((getline line < file) < 1) { # This will get the last line from scratch1
                    have = split("", field)
                    break
                }

                have = split(line, field) # split entries into field
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break # break out of loop
            }

        }
From the last entry above it will now look for the next entry in scratch2, which there is none.

So now it goes into the END section:
Code:
END {
        while ((getline line < file) > 0) # this will be 0 as we already have the last line previously
            print line
    }
As you can see, the entry already retrieved from scratch1 has been discarded due to the getline in END, prior to it being tested
and therefore printed.

Let me know how you get on?
 
Old 04-19-2011, 05:13 AM   #11
masavini
Member
 
Registered: Jun 2008
Posts: 269

Original Poster
Rep: Reputation: 6
sorry, i really can't understand...

i mean, the "problematic" line is the last of scratch1:
product01;code02;


Code:
    {

        if ($1 == field[1] && $2 == field[2]) {    <- this condition is not satisfied, so it should skip down
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0       <-- HERE and print the line as is
    }

what's wrong in my thought?
 
Old 04-19-2011, 09:48 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,804

Rep: Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069
Quote:
if ($1 == field[1] && $2 == field[2]) { <- this condition is not satisfied, so it should skip down
What makes you think this?
 
Old 04-19-2011, 10:47 AM   #13
masavini
Member
 
Registered: Jun 2008
Posts: 269

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by grail View Post
What makes you think this?
i feel like back school again... and like if i didn't study enough...
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file. in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)...
what am i missing?

btw, i'm about for dropping and doing it with old good bash:

Code:
while read LINE
do
   CHECK=$(echo "$LINE" | awk -F ';' '{print $1";"$2}')
   
   rm scratch
   cat file1 | grep "^$CHECK;" > scratch
   if [ -e scratch ] && [ "$(cat scratch | awk -F ';' '{print $3}' | head -1)" != "" ]; then
      cat scratch >> new-file
      echo "$LINE" >> new-file
      sed -i /"^$CHECK;"/d file1
   else
      echo "$LINE" >> new-file
   fi
done < file2

cat new-file | sed -n '/[^;]*;[^;]*;[^;]/p' > new-file2

cat file1 new-file2 | sort | uniq > output
this perfectly works, is just VERY slow...
 
Old 04-19-2011, 11:37 AM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,804

Rep: Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069Reputation: 3069
Quote:
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file
This part is correct
Quote:
in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)
But this is not. You are not following at what point you are up to in each file.

So let us step through the code:

1. The BEGIN - apart from setting a few variables, the only thing this gives us is the first line of scratch1 split into the fields array

2. Testing for empty fields $1 and $2 - again this is never used in our particular example

3. So the following section of code is our focus:
Code:
    {

        if ($1 == field[1] && $2 == field[2]) {
            if ($3 == "") $3 = field[3]
            print $0

            while (1) {

                if ((getline line < file) < 1) {
                    have = split("", field)
                    break
                }

                have = split(line, field)
                if (field[1] == "" || field[2] == "") {
                    print line
                    continue
                }

                break
            }

        } else
            print $0
    }
3a. Are first and second fields from both files equal - for the first line in our file this is true and $3 from scratch2 is not empty so the line below is printed:
Code:
product01;code01;brand1
3b. Enter the while loop - get the next line from scratch1 which does exist so if is false.
- split into field array and test if either field is empty - they are not so if is false and break is executed (ie we leave the while loop)
3c. As per point 2 above, test is false. Test if items from both files are equal (remembering that field array was set on previous pass) - again they are both equal and $3 in scratch2 is not empty
so print line below:
Code:
product01;code01;brand2
3d. This is a copy of all steps performed in 3b above

Here is where you seem to be getting lost

3e. There are now no more lines in scratch2 to be read so we now jump to END (notice how this differs from 3c, ie we never reach the test you were thinking of). So END says:
Code:
END {
    while ((getline line < file) > 0)
        print line
}
So it requests that we get the next line from scratch1, but we already retrieved the last line at 3d above. So now the getline test will fail as it is not greater than zero, hence
the while loop is immediately exited and therefore so is the script.

I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.

I can also tell you that there are plenty of changes that can be made to your bash, awk, sed, grep and cat script. The main one being that the use of so many different
command are almost never required. The first would be that nearly every call to cat is not required, but that is a story for another time

Let me know if you do not follow the logic as supplied above?
 
Old 04-19-2011, 01:08 PM   #15
masavini
Member
 
Registered: Jun 2008
Posts: 269

Original Poster
Rep: Reputation: 6
Quote:
I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.
i find learning gratifying as well, but if gratifications are balanced or exceeded by frustrations, it's not a deal: the question i posted is just one of the dozens issues i'm finding writing this awk script, so i guess i'd better give up...

many, many thanks for you support
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
merge columns from multiple files vijay_babu1981 Linux - Newbie 21 06-24-2014 07:59 AM
[SOLVED] Comparing files in bash script brainlesseinstein Linux - Newbie 9 09-02-2010 01:41 PM
how to merge files in different folder using bash script kkpal Programming 5 04-21-2008 03:35 AM
awk command to merge columns from two separate files into single file? johnpaulodonnell Linux - Newbie 4 01-23-2007 11:10 AM
Is there a command to merge two files as two columns of one file? davee Linux - General 2 07-19-2005 11:52 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration