LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   bash: sort lines in 2 files so that equal lines are at the same line number... (http://www.linuxquestions.org/questions/programming-9/bash-sort-lines-in-2-files-so-that-equal-lines-are-at-the-same-line-number-4175412056/)

masavini 06-18-2012 09:47 AM

bash: sort lines in 2 files so that equal lines are at the same line number...
 
hi,
i have two files like these:

file1.csv:
Code:

23382;gigi;2
23312;gaga;3
244442;bubu;aa
233332;dede;1

file2.csv:
Code:

3;2;baba
gg;2;gigi
a;23;dede
b;3;mimi

i'd sort the first file so that lines in which 2nd field is present in 3rd field of 2nd file are in the same position...
in this example the first line (2nd field = gigi) should be printed out in 2nd line and 4th line (dede) should be printed as the 3rd line.
order of non matching lines does not matter...

example output:
Code:

23312;gaga;2
23382;gigi;3
233332;dede;aa
244442;bubu;1

many thanks!

pixellany 06-18-2012 10:07 AM

I've just read thru this 3 times, and I cannot see what the rule is. Perhaps a before and after example with more lines in each file.

Possibly more important, what overall problem are you trying to solve?

danielbmartin 06-18-2012 10:27 AM

Quote:

Originally Posted by masavini (Post 4706121)
i have two files like these ...

Please review your post and make corrections if needed. You offer two sample input files but they both have the same name, file1.csv. CSV usually means comma separated values but your sample files have semicolons rather than commas. The words in your post seem to not match the data in your sample files.

I have a hunch that join will perform most (possibly all) of the desired function.

Daniel B. Martin

masavini 06-18-2012 05:08 PM

Quote:

Originally Posted by danielbmartin (Post 4706151)
Please review your post and make corrections if needed.

sorry... this problem has been twisting in my mind for a week and it's so clear to me that i didn't guess it would have been so hard to explain.

let say we have 2 text files... they contain fields separated by ;
one field of the first file and one field of the 2nd file MAY contain the same value:

Code:

$ cat file1
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random number1;value1;randomnumber1
random n4;value4;randomnumb4

Code:

$ cat file2
string1;number1;val1
str2;random number2;interesting value2
some string3;random n3;interesting value3
strr4;randomn4;some value4

please notice that the 2nd field of the first file and the 3rd field of second file contain some common values: "interesting value2" and "interesting value3"

processing file1 line by line:
the 2nd field of the first line is "interesting value2". this value is present in last column of second line of file2, so this line of file1 ("random number2;interesting value2;randomnumber2") should be printed as the second line of the output.
the 2nd field of the 2nd line ("interesting value3") is present in the last field of the 3rd line of file2. so the line "random n3;interesting value3;randomnumb3" of file1 should be printed as the 3rd line of the output.
for both the 3rd and the 4th lines of file1, the values of the 2nd fields are not present in the last column of file2. so i don't mind the row number of these lines in the output.

i.e.:
Code:

$ sort-with-template.sh file1 -field=2 file2 -field=3
random number1;value1;randomnumber1
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random n4;value4;randomnumb4

or:
Code:

$ sort-with-template.sh file1 -field=2 file2 -field=3
random n4;value4;randomnumb4
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random number1;value1;randomnumber1

in this example there are 2 possible outputs, since the 1st and the 4th lines of input can be printed in both the 1st and the last lines of output... it does not matter.

thank you for your patience...

masavini 06-18-2012 05:42 PM

i can try and post some code...

Code:

#!/bin/bash

awk -F ';' '{print $2}' /tmp/file1 > interestingValues1
awk -F ';' '{print $3}' /tmp/file2 > interestingValues2

i=0
while read value; do
        check=$(grep "$value" interestingValues1)
        if [[ ! -z $check ]]; then
                grep "$value" /tmp/file1 >> out
                sed -i /"$value"/d interestingValues1
        else
                echo "mismatch $i" >> out
                let i+=1
        fi
done < interestingValues2

if [[ -s interestingValues1 ]]; then
        i=0
        while read value; do
                line=$(grep "$value" /tmp/file1)
                sed -i s/'mismatch '"$i"/"$line"/ out
                let i+=1
        done < interestingValues1
fi

cat out

this works, but i guess it's quite ugly...

Nominal Animal 06-18-2012 06:20 PM

Here is my suggestion as an awk script. Save the script, then run it, specifying the files in reverse order, i.e. file2.csv file1.csv.
Code:

#!/usr/bin/awk -f
BEGIN {
    # Each line is a separate record. Allow any newline convention.
    RS = "(\r\n|\n\r|\r|\n)"

    # Fields within a record are separated by semicolons.
    FS = ";"

    # For output, use newlines. (OFS is not used by this script.)
    ORS = "\n"

    # Which input file (1 = first, 2 = second) currently being processed?
    file = 0

    # Number of unused entries?
    unuseds = 0

    # Maximum line number for the used entries?
    lines = 0
}

(FNR == 1) {
    # This is the first record of an input file, so increase the file counter.
    file++
}

(file == 1 && NF >= 3) {
    # First input file, and record contains at least three fields.
    # Check for duplicate entries, then save the line number for this key.
    if ($3 in lineno)
        printf("%s: Line %d: Repeated \"%s\", first seen on line %d.\n", FILENAME, FNR, $3, lineno[$3]) > "/dev/stderr"
    else
        lineno[$3] = FNR
}

(file == 2 && NF >= 2) {
    # Second input file, and record contains at least two fields.
    # Check if there is a saved line number for this key.
    if ($2 in lineno) {
        # Store this record in the "used" array.
        used[lineno[$2]] = $0
        # Save the largest line number seen in "lines".
        if (lineno[$2] > lines)
            lines = lineno[$2]
    } else {
        # No, this is one of the filler lines.
        unused[++unuseds] = $0
    }
}

END {
    # No filler lines used yet.
    unline = 0

    # We know that we must produce at least lines lines,
    # to correctly populate each output line.
    # So, loop over output lines in order.
    for (line = 1; line <= lines; line++) {
        if (line in used) {
            # This line is reserved for one of the sved lines.
            printf("%s%s", used[line], ORS)
        } else
        if (unline < unuseds) {
            # No matching line, but filler lines left, so use a filler line.
            printf("%s%s", unused[++unline], ORS)
        } else {
            # No filler lines left. Umm.. this should not happen, I think.
            # Well, just output an empty line, then.
            printf("%s", ORS)
        }
    }

    # Output any filler lines still left.
    while (unline < unuseds)
        printf("%s%s", unused[++unline], ORS)
}

The script first scans file2.csv for the identifiers, and save the line number each identifier was seen on. This is why this file needs to be listed first.

The script then scans file1.csv. If there is a line number for the identifier in each record, then the current record is saved in the used array, indexed using the line number used for output. It also keeps the largest line number used for output in lines, since that is the minimum number of lines the script must output.

Records with an identifier that has no specified output line number, are saved in the unused array, with a monotonically increasing index unuseds.

The END rule is processed once after all records have been processed. Here, we have two loops. The first loop goes from 1 to lines, and outputs the record from the used array for that line if there is one. Otherwise, it will pick the next filler line from the unused array. (Note that this means the unused lines will be used in the order they were seen, not in random order. I think this should be most useful for you.) The second loop in the END rule just makes sure all filler lines have been output.

This model is quite efficient wrt. disk I/O. Associative arrays in awk are quite fast, too: most awks use hashing and other efficient access algorithms. The only downside is that all records from file1.csv are read into memory first. The overall memory use is somewhat larger than that file. Hopefully that won't be an issue. (If you have a 64-bit distribution, you can easily handle much larger datasets than can fit in memory using most awk variants; the script will just cause a lot of swapping ("trashing"), and be quite slow. But it should work even then.)

Questions? Comments? Suggestions?

masavini 06-19-2012 08:28 AM

it works!
thank you...

danielbmartin 06-21-2012 06:07 AM

This is an interesting problem. The data in File1 must be reordered according to key matches which may be found in File2. Non-trivial.

LQ Senior Member Nominal Animal contributed an awk solution. I respect and admire Nominal's skill and have no wish to compete with him. However, my coding philosophy avoids explicit loops wherever possible on the premise that loops (especially loops in interpreted languages) are slower than one-pass commands. [Candid admission: this is not always true.]

With this post I offer another proposed solution, one which does not use explicit loops. My solution has the disadvantage of freely using temporary files. Consequently the efficiency of loopless code may be offset by the I/O involved with those work files.

I ask that OP masavini run my code and report execution times with further posts to this thread. We may all learn something from those results.

File1 contents may be characterized this way: each line contains
Code:

baggage;key;baggage
I find it convenient to work with a derivative version of this file in which each line is prefixed with the key and its line number, as:
Code:

key;seqnum;baggage;key;baggage
To use my code you will have to change the file identifiers to suit your own computing environment.

Without further ado, this is my program in its entirety:
Code:

#!/bin/bash
#  Daniel B. Martin  Jun12
#
#  To execute this program, launch a terminal sesson and enter:
#  bash /home/daniel/Desktop/LQfiles/dbm398.bin
#
# This program inspired by ...
# http://www.linuxquestions.org/questions/programming-9/
# bash-sort-lines-in-2-files-so-that-equal-lines-are-at-the-same-line-number-4175412056/

# file1 (key value in 2nd field) ...
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random number1;value1;randomnumber1
# random n4;value4;randomnumb4

# file2 (key value in 3rd field) ...
# string1;number1;val1
# str2;random number2;interesting value2
# some string3;random n3;interesting value3
# strr4;randomn4;some value4

# Note: interesting value2 and interesting value3
#  are key field values which appear in both input files.

# An acceptable output file ...
# random number1;value1;randomnumber1
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random n4;value4;randomnumb4

# Another acceptable output file ...
# random n4;value4;randomnumb4
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random number1;value1;randomnumber1


# File identification
InFile1='/home/daniel/Desktop/LQfiles/dbm398inp1.txt'
InFile2='/home/daniel/Desktop/LQfiles/dbm398inp2.txt'
OutFile='/home/daniel/Desktop/LQfiles/dbm398out.txt'
Work01='/home/daniel/Desktop/LQfiles/dbm398w01.txt'
Work02='/home/daniel/Desktop/LQfiles/dbm398w02.txt'
Work03='/home/daniel/Desktop/LQfiles/dbm398w03.txt'
Work04='/home/daniel/Desktop/LQfiles/dbm398w04.txt'
Work05='/home/daniel/Desktop/LQfiles/dbm398w05.txt'
Work06='/home/daniel/Desktop/LQfiles/dbm398w06.txt'
Work07='/home/daniel/Desktop/LQfiles/dbm398w07.txt'
Work08='/home/daniel/Desktop/LQfiles/dbm398w08.txt'
Work09='/home/daniel/Desktop/LQfiles/dbm398w09.txt'
Work10='/home/daniel/Desktop/LQfiles/dbm398w10.txt'
Work11='/home/daniel/Desktop/LQfiles/dbm398w11.txt'
Work12='/home/daniel/Desktop/LQfiles/dbm398w12.txt'


echo "Prefix lines in InFile1 with key value and sequence number."
echo "Save key values, seperately."
awk -F";" '{printf "%s;%06d;%s\n", $2,NR,$0}' < $InFile1 \
|tee $Work11                    \
|cut -d ';' -f1                \
|sort                          \
> $Work01

echo "Prefix lines in InFile2 with key value and sequence number."
echo "Save key values, seperately."
awk -F";" '{printf "%s;%06d;%s\n", $3,NR,$0}' < $InFile2 \
|tee $Work12                    \
|cut -d ';' -f1                  \
|sort                            \
> $Work02

echo "Find key values common to both Work files."
echo "Prefix each string with ^ (starting with) for later use with grep -f."
comm -12 $Work01 $Work02        \
|sed 's/^/\^/g'                  \
> $Work03

echo "Identify lines in InFile2 WITH key-value matches."
echo "Carve out the content of those lines."
echo "This will become part of the final output file."
grep -f $Work03 < $Work12        \
|join -1 1 -2 1 -t';' - $Work11  \
|cut -d ';' -f2,7-              \
> $Work04

echo "Identify lines in InFile1 WITHOUT key-value matches."
echo "Carve out lines which will become part of the final output file."
grep -v -f $Work03 < $Work11    \
|cut -d ';' -f3-                \
> $Work05

echo "Determine the output file line numbers for these lines."
echo "Those positions will be whatever was not already assigned."
echo "Assign sequence numbers to the 'no match' content lines from InFile1."
cut -d ';' -f1 $Work04 > $Work07
cut -d ';' -f2 $Work11 > $Work08
comm -3 $Work07 $Work08          \
|sed 's/^[ \t]*//'              \
|paste -d ";" - $Work05 > $Work10

echo "Merge partial result files and trim position values."
sort -m $Work10 $Work04          \
|cut -d ";" -f2-                \
> $OutFile


#
# "An acceptable output file ..."
# "random number1;value1;randomnumber1"
# "random number2;interesting value2;randomnumber2"
# "random n3;interesting value3;randomnumb3"
# "random n4;value4;randomnumb4"
#
# "Another acceptable output file ..."
# "random n4;value4;randomnumb4"
# "random number2;interesting value2;randomnumber2"
# "random n3;interesting value3;randomnumb3"
# "random number1;value1;randomnumber1"

#
# "Contents of OutFile (the finished product.)"
# cat $OutFile

echo
echo "Our work is done.  Punch out.  Go home."
echo 'Normal end of job.'
echo 'Execution ended.'
echo

exit

Daniel B. Martin

Nominal Animal 06-21-2012 08:27 AM

Quote:

Originally Posted by danielbmartin (Post 4708455)
Nominal's skill

:p Blrblblblbl, I make more than my share of mistakes and errors.

Personally, different approaches to solving the same problem is one of the reasons I'm a member here. I very much appreciate seeing others' solutions; doubly so when the approach/methodology is discussed or explained. Simply put, "competing" solutions are extremely valuable, in my opinion. I don't think of this as a competition, more like a friendly discussion regarding different approaches.

That said, I'd like to suggest a small change with respect to temporary files: use an automatically deleted temporary directory to house all the temporary files:
Code:

Work="$(mktemp -d)" || exit $?
trap "rm -rf '$Work'" EXIT

The above creates a safe (not accessible to other users) temporary directory (or aborts the script if that proves impossible), then sets a trap that removes the directory and all its contents automatically when the script exits. The trap works even if the script dies due to an error. The quoting is such that $Work is evaluated immediately, so even if you change the value of Work later, it will not affect the trap. It is quite safe to use.

Instead of using $Work01 in your script, you'd use "$Work/01" for example. (Initially, $Work will always be a pristine, empty directory, so you can freely choose any file names you wish for the temporary files.)

masavini 06-21-2012 09:12 AM

i'll be honest... i'm still using my old script version...

the problem with nominal animal solution was that "real" file1 has VERY long and "complex" lines... the last field of each line contains a long html page with javascripts and MANY special characters...

while using nominal animal script, it happened that even if file1 and file2 had same row number, the output was shorter... a few lines were always missing and i had no time for proper debugging...

this routine is in the middle of one of my most important scripts, so i need VERY stable code (even if it's a bit slower)...

here is my actual code:

Code:

        awk -F ';' '{print $2}' file1 > titles1
        awk -F ';' '{print $3}' file2 > titles2
       
        i=0
        while read title; do
                check=$(grep "$title" titles1)
                if [[ ! -z $check ]]; then
                        grep -i "$title" file1 >> scratch
                        sed -i /"$title"/d titles1
                else
                        echo "mismatch" >> scratch
                        let i+=1
                fi
        done < titles2
       
        if [[ -s titles1 ]]; then
                i=0
                while read line; do
                        if [[ $line == "mismatch" ]]; then
                                title=$(head -1 titles1)
                                grep -i "$title" file1 >> output.csv
                                sed -i '1d' titles1
                        else
                                echo "$line" >> output.csv
                        fi
                        let i+=1
                done < scratch
        fi

        cat output.csv


danielbmartin 06-21-2012 01:58 PM

Quote:

Originally Posted by Nominal Animal (Post 4708529)
... I'd like to suggest a small change with respect to temporary files: use an automatically deleted temporary directory to house all the temporary files ...

Yes, of course. Production code should be "well behaved" and clean up after itself. For program development I leave the work files on disk for post-execution inspection. They assure me that all the interim steps worked as designed... and if not, they show where things went awry.

With regard to execution time: the small sample files provided by OP run in zero time with my code and probably yours too. Could you generate test files with 50,000 lines for a real horse race?

Daniel B. Martin


All times are GMT -5. The time now is 02:27 AM.