LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-18-2012, 09:47 AM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 193

Rep: Reputation: 6
bash: sort lines in 2 files so that equal lines are at the same line number...


hi,
i have two files like these:

file1.csv:
Code:
23382;gigi;2
23312;gaga;3
244442;bubu;aa
233332;dede;1
file2.csv:
Code:
3;2;baba
gg;2;gigi
a;23;dede
b;3;mimi
i'd sort the first file so that lines in which 2nd field is present in 3rd field of 2nd file are in the same position...
in this example the first line (2nd field = gigi) should be printed out in 2nd line and 4th line (dede) should be printed as the 3rd line.
order of non matching lines does not matter...

example output:
Code:
23312;gaga;2
23382;gigi;3
233332;dede;aa
244442;bubu;1
many thanks!

Last edited by masavini; 06-18-2012 at 04:42 PM.
 
Old 06-18-2012, 10:07 AM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
I've just read thru this 3 times, and I cannot see what the rule is. Perhaps a before and after example with more lines in each file.

Possibly more important, what overall problem are you trying to solve?
 
Old 06-18-2012, 10:27 AM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Quote:
Originally Posted by masavini View Post
i have two files like these ...
Please review your post and make corrections if needed. You offer two sample input files but they both have the same name, file1.csv. CSV usually means comma separated values but your sample files have semicolons rather than commas. The words in your post seem to not match the data in your sample files.

I have a hunch that join will perform most (possibly all) of the desired function.

Daniel B. Martin
 
Old 06-18-2012, 05:08 PM   #4
masavini
Member
 
Registered: Jun 2008
Posts: 193

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by danielbmartin View Post
Please review your post and make corrections if needed.
sorry... this problem has been twisting in my mind for a week and it's so clear to me that i didn't guess it would have been so hard to explain.

let say we have 2 text files... they contain fields separated by ;
one field of the first file and one field of the 2nd file MAY contain the same value:

Code:
$ cat file1
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random number1;value1;randomnumber1
random n4;value4;randomnumb4
Code:
$ cat file2
string1;number1;val1
str2;random number2;interesting value2
some string3;random n3;interesting value3
strr4;randomn4;some value4
please notice that the 2nd field of the first file and the 3rd field of second file contain some common values: "interesting value2" and "interesting value3"

processing file1 line by line:
the 2nd field of the first line is "interesting value2". this value is present in last column of second line of file2, so this line of file1 ("random number2;interesting value2;randomnumber2") should be printed as the second line of the output.
the 2nd field of the 2nd line ("interesting value3") is present in the last field of the 3rd line of file2. so the line "random n3;interesting value3;randomnumb3" of file1 should be printed as the 3rd line of the output.
for both the 3rd and the 4th lines of file1, the values of the 2nd fields are not present in the last column of file2. so i don't mind the row number of these lines in the output.

i.e.:
Code:
$ sort-with-template.sh file1 -field=2 file2 -field=3
random number1;value1;randomnumber1
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random n4;value4;randomnumb4
or:
Code:
$ sort-with-template.sh file1 -field=2 file2 -field=3
random n4;value4;randomnumb4
random number2;interesting value2;randomnumber2
random n3;interesting value3;randomnumb3
random number1;value1;randomnumber1
in this example there are 2 possible outputs, since the 1st and the 4th lines of input can be printed in both the 1st and the last lines of output... it does not matter.

thank you for your patience...

Last edited by masavini; 06-18-2012 at 05:10 PM.
 
Old 06-18-2012, 05:42 PM   #5
masavini
Member
 
Registered: Jun 2008
Posts: 193

Original Poster
Rep: Reputation: 6
i can try and post some code...

Code:
#!/bin/bash

awk -F ';' '{print $2}' /tmp/file1 > interestingValues1
awk -F ';' '{print $3}' /tmp/file2 > interestingValues2

i=0
while read value; do
	check=$(grep "$value" interestingValues1)
	if [[ ! -z $check ]]; then
		grep "$value" /tmp/file1 >> out
		sed -i /"$value"/d interestingValues1
	else
		echo "mismatch $i" >> out
		let i+=1
	fi
done < interestingValues2

if [[ -s interestingValues1 ]]; then
	i=0
	while read value; do
		line=$(grep "$value" /tmp/file1)
		sed -i s/'mismatch '"$i"/"$line"/ out
		let i+=1
	done < interestingValues1
fi

cat out
this works, but i guess it's quite ugly...
 
Old 06-18-2012, 06:20 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Here is my suggestion as an awk script. Save the script, then run it, specifying the files in reverse order, i.e. file2.csv file1.csv.
Code:
#!/usr/bin/awk -f
BEGIN {
    # Each line is a separate record. Allow any newline convention.
    RS = "(\r\n|\n\r|\r|\n)"

    # Fields within a record are separated by semicolons.
    FS = ";"

    # For output, use newlines. (OFS is not used by this script.)
    ORS = "\n"

    # Which input file (1 = first, 2 = second) currently being processed?
    file = 0

    # Number of unused entries?
    unuseds = 0

    # Maximum line number for the used entries?
    lines = 0
}

(FNR == 1) {
    # This is the first record of an input file, so increase the file counter.
    file++
}

(file == 1 && NF >= 3) {
    # First input file, and record contains at least three fields.
    # Check for duplicate entries, then save the line number for this key.
    if ($3 in lineno)
        printf("%s: Line %d: Repeated \"%s\", first seen on line %d.\n", FILENAME, FNR, $3, lineno[$3]) > "/dev/stderr"
    else
        lineno[$3] = FNR
}

(file == 2 && NF >= 2) {
    # Second input file, and record contains at least two fields.
    # Check if there is a saved line number for this key.
    if ($2 in lineno) {
        # Store this record in the "used" array.
        used[lineno[$2]] = $0
        # Save the largest line number seen in "lines".
        if (lineno[$2] > lines)
            lines = lineno[$2]
    } else {
        # No, this is one of the filler lines.
        unused[++unuseds] = $0
    }
}

END {
    # No filler lines used yet.
    unline = 0

    # We know that we must produce at least lines lines,
    # to correctly populate each output line.
    # So, loop over output lines in order.
    for (line = 1; line <= lines; line++) {
        if (line in used) {
            # This line is reserved for one of the sved lines.
            printf("%s%s", used[line], ORS)
        } else
        if (unline < unuseds) {
            # No matching line, but filler lines left, so use a filler line.
            printf("%s%s", unused[++unline], ORS)
        } else {
            # No filler lines left. Umm.. this should not happen, I think.
            # Well, just output an empty line, then.
            printf("%s", ORS)
        }
    }

    # Output any filler lines still left.
    while (unline < unuseds)
        printf("%s%s", unused[++unline], ORS)
}
The script first scans file2.csv for the identifiers, and save the line number each identifier was seen on. This is why this file needs to be listed first.

The script then scans file1.csv. If there is a line number for the identifier in each record, then the current record is saved in the used array, indexed using the line number used for output. It also keeps the largest line number used for output in lines, since that is the minimum number of lines the script must output.

Records with an identifier that has no specified output line number, are saved in the unused array, with a monotonically increasing index unuseds.

The END rule is processed once after all records have been processed. Here, we have two loops. The first loop goes from 1 to lines, and outputs the record from the used array for that line if there is one. Otherwise, it will pick the next filler line from the unused array. (Note that this means the unused lines will be used in the order they were seen, not in random order. I think this should be most useful for you.) The second loop in the END rule just makes sure all filler lines have been output.

This model is quite efficient wrt. disk I/O. Associative arrays in awk are quite fast, too: most awks use hashing and other efficient access algorithms. The only downside is that all records from file1.csv are read into memory first. The overall memory use is somewhat larger than that file. Hopefully that won't be an issue. (If you have a 64-bit distribution, you can easily handle much larger datasets than can fit in memory using most awk variants; the script will just cause a lot of swapping ("trashing"), and be quite slow. But it should work even then.)

Questions? Comments? Suggestions?
 
Old 06-19-2012, 08:28 AM   #7
masavini
Member
 
Registered: Jun 2008
Posts: 193

Original Poster
Rep: Reputation: 6
it works!
thank you...
 
Old 06-21-2012, 06:07 AM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
This is an interesting problem. The data in File1 must be reordered according to key matches which may be found in File2. Non-trivial.

LQ Senior Member Nominal Animal contributed an awk solution. I respect and admire Nominal's skill and have no wish to compete with him. However, my coding philosophy avoids explicit loops wherever possible on the premise that loops (especially loops in interpreted languages) are slower than one-pass commands. [Candid admission: this is not always true.]

With this post I offer another proposed solution, one which does not use explicit loops. My solution has the disadvantage of freely using temporary files. Consequently the efficiency of loopless code may be offset by the I/O involved with those work files.

I ask that OP masavini run my code and report execution times with further posts to this thread. We may all learn something from those results.

File1 contents may be characterized this way: each line contains
Code:
baggage;key;baggage
I find it convenient to work with a derivative version of this file in which each line is prefixed with the key and its line number, as:
Code:
key;seqnum;baggage;key;baggage
To use my code you will have to change the file identifiers to suit your own computing environment.

Without further ado, this is my program in its entirety:
Code:
#!/bin/bash
#   Daniel B. Martin   Jun12
#
#   To execute this program, launch a terminal sesson and enter:
#   bash /home/daniel/Desktop/LQfiles/dbm398.bin
#
# This program inspired by ...
# http://www.linuxquestions.org/questions/programming-9/
# bash-sort-lines-in-2-files-so-that-equal-lines-are-at-the-same-line-number-4175412056/

# file1 (key value in 2nd field) ...
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random number1;value1;randomnumber1
# random n4;value4;randomnumb4

# file2 (key value in 3rd field) ...
# string1;number1;val1
# str2;random number2;interesting value2
# some string3;random n3;interesting value3
# strr4;randomn4;some value4

# Note: interesting value2 and interesting value3
#  are key field values which appear in both input files.

# An acceptable output file ...
# random number1;value1;randomnumber1
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random n4;value4;randomnumb4

# Another acceptable output file ...
# random n4;value4;randomnumb4
# random number2;interesting value2;randomnumber2
# random n3;interesting value3;randomnumb3
# random number1;value1;randomnumber1


# File identification
InFile1='/home/daniel/Desktop/LQfiles/dbm398inp1.txt'
InFile2='/home/daniel/Desktop/LQfiles/dbm398inp2.txt'
OutFile='/home/daniel/Desktop/LQfiles/dbm398out.txt'
Work01='/home/daniel/Desktop/LQfiles/dbm398w01.txt'
Work02='/home/daniel/Desktop/LQfiles/dbm398w02.txt'
Work03='/home/daniel/Desktop/LQfiles/dbm398w03.txt'
Work04='/home/daniel/Desktop/LQfiles/dbm398w04.txt'
Work05='/home/daniel/Desktop/LQfiles/dbm398w05.txt'
Work06='/home/daniel/Desktop/LQfiles/dbm398w06.txt'
Work07='/home/daniel/Desktop/LQfiles/dbm398w07.txt'
Work08='/home/daniel/Desktop/LQfiles/dbm398w08.txt'
Work09='/home/daniel/Desktop/LQfiles/dbm398w09.txt'
Work10='/home/daniel/Desktop/LQfiles/dbm398w10.txt'
Work11='/home/daniel/Desktop/LQfiles/dbm398w11.txt'
Work12='/home/daniel/Desktop/LQfiles/dbm398w12.txt'


echo "Prefix lines in InFile1 with key value and sequence number."
echo "Save key values, seperately."
awk -F";" '{printf "%s;%06d;%s\n", $2,NR,$0}' < $InFile1 \
|tee $Work11                    \
|cut -d ';' -f1                 \
|sort                           \
> $Work01

echo "Prefix lines in InFile2 with key value and sequence number."
echo "Save key values, seperately."
awk -F";" '{printf "%s;%06d;%s\n", $3,NR,$0}' < $InFile2 \
|tee $Work12                     \
|cut -d ';' -f1                  \
|sort                            \
> $Work02

echo "Find key values common to both Work files."
echo "Prefix each string with ^ (starting with) for later use with grep -f."
comm -12 $Work01 $Work02         \
|sed 's/^/\^/g'                  \
> $Work03

echo "Identify lines in InFile2 WITH key-value matches."
echo "Carve out the content of those lines."
echo "This will become part of the final output file."
grep -f $Work03 < $Work12        \
|join -1 1 -2 1 -t';' - $Work11  \
|cut -d ';' -f2,7-               \
> $Work04

echo "Identify lines in InFile1 WITHOUT key-value matches."
echo "Carve out lines which will become part of the final output file."
grep -v -f $Work03 < $Work11     \
|cut -d ';' -f3-                 \
> $Work05

echo "Determine the output file line numbers for these lines."
echo "Those positions will be whatever was not already assigned."
echo "Assign sequence numbers to the 'no match' content lines from InFile1."
cut -d ';' -f1 $Work04 > $Work07
cut -d ';' -f2 $Work11 > $Work08
comm -3 $Work07 $Work08          \
|sed 's/^[ \t]*//'               \
|paste -d ";" - $Work05 > $Work10

echo "Merge partial result files and trim position values."
sort -m $Work10 $Work04          \
|cut -d ";" -f2-                 \
> $OutFile


#
# "An acceptable output file ..."
# "random number1;value1;randomnumber1"
# "random number2;interesting value2;randomnumber2"
# "random n3;interesting value3;randomnumb3"
# "random n4;value4;randomnumb4"
#
# "Another acceptable output file ..."
# "random n4;value4;randomnumb4"
# "random number2;interesting value2;randomnumber2"
# "random n3;interesting value3;randomnumb3"
# "random number1;value1;randomnumber1"

#
# "Contents of OutFile (the finished product.)"
# cat $OutFile

echo
echo "Our work is done.  Punch out.  Go home."
echo 'Normal end of job.'
echo 'Execution ended.'
echo

exit
Daniel B. Martin
 
Old 06-21-2012, 08:27 AM   #9
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by danielbmartin View Post
Nominal's skill
Blrblblblbl, I make more than my share of mistakes and errors.

Personally, different approaches to solving the same problem is one of the reasons I'm a member here. I very much appreciate seeing others' solutions; doubly so when the approach/methodology is discussed or explained. Simply put, "competing" solutions are extremely valuable, in my opinion. I don't think of this as a competition, more like a friendly discussion regarding different approaches.

That said, I'd like to suggest a small change with respect to temporary files: use an automatically deleted temporary directory to house all the temporary files:
Code:
Work="$(mktemp -d)" || exit $?
trap "rm -rf '$Work'" EXIT
The above creates a safe (not accessible to other users) temporary directory (or aborts the script if that proves impossible), then sets a trap that removes the directory and all its contents automatically when the script exits. The trap works even if the script dies due to an error. The quoting is such that $Work is evaluated immediately, so even if you change the value of Work later, it will not affect the trap. It is quite safe to use.

Instead of using $Work01 in your script, you'd use "$Work/01" for example. (Initially, $Work will always be a pristine, empty directory, so you can freely choose any file names you wish for the temporary files.)
 
2 members found this post helpful.
Old 06-21-2012, 09:12 AM   #10
masavini
Member
 
Registered: Jun 2008
Posts: 193

Original Poster
Rep: Reputation: 6
i'll be honest... i'm still using my old script version...

the problem with nominal animal solution was that "real" file1 has VERY long and "complex" lines... the last field of each line contains a long html page with javascripts and MANY special characters...

while using nominal animal script, it happened that even if file1 and file2 had same row number, the output was shorter... a few lines were always missing and i had no time for proper debugging...

this routine is in the middle of one of my most important scripts, so i need VERY stable code (even if it's a bit slower)...

here is my actual code:

Code:
	awk -F ';' '{print $2}' file1 > titles1
	awk -F ';' '{print $3}' file2 > titles2
	
	i=0
	while read title; do
		check=$(grep "$title" titles1)
		if [[ ! -z $check ]]; then
			grep -i "$title" file1 >> scratch
			sed -i /"$title"/d titles1
		else
			echo "mismatch" >> scratch
			let i+=1
		fi
	done < titles2
	
	if [[ -s titles1 ]]; then
		i=0
		while read line; do
			if [[ $line == "mismatch" ]]; then
				title=$(head -1 titles1)
				grep -i "$title" file1 >> output.csv
				sed -i '1d' titles1
			else
				echo "$line" >> output.csv
			fi
			let i+=1
		done < scratch
	fi

	cat output.csv
 
Old 06-21-2012, 01:58 PM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,136

Rep: Reputation: 299Reputation: 299Reputation: 299
Quote:
Originally Posted by Nominal Animal View Post
... I'd like to suggest a small change with respect to temporary files: use an automatically deleted temporary directory to house all the temporary files ...
Yes, of course. Production code should be "well behaved" and clean up after itself. For program development I leave the work files on disk for post-execution inspection. They assure me that all the interim steps worked as designed... and if not, they show where things went awry.

With regard to execution time: the small sample files provided by OP run in zero time with my code and probably yours too. Could you generate test files with 50,000 lines for a real horse race?

Daniel B. Martin
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] bash matching two files lines by lines rperezalejo Programming 6 01-12-2012 06:42 AM
BASH: rearrange lines in a file by line number & print in that order SilversleevesX Linux - Newbie 7 09-20-2011 07:31 PM
commands for bash script that handles files of varying number of lines BBFeltham Linux - Newbie 1 07-26-2008 10:18 AM
Is there a line limit with the sort utility? Trying to sort 130 million lines of text gruffy Linux - General 4 08-10-2006 08:40 PM
Shell Scripting: How to pick lines out of files by line number. Louie55 Programming 3 03-22-2005 06:18 PM


All times are GMT -5. The time now is 11:36 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration