bash: sort lines in 2 files so that equal lines are at the same line number...
hi,
i have two files like these: file1.csv: Code:
23382;gigi;2 Code:
3;2;baba in this example the first line (2nd field = gigi) should be printed out in 2nd line and 4th line (dede) should be printed as the 3rd line. order of non matching lines does not matter... example output: Code:
23312;gaga;2 |
I've just read thru this 3 times, and I cannot see what the rule is. Perhaps a before and after example with more lines in each file.
Possibly more important, what overall problem are you trying to solve? |
Quote:
I have a hunch that join will perform most (possibly all) of the desired function. Daniel B. Martin |
Quote:
let say we have 2 text files... they contain fields separated by ; one field of the first file and one field of the 2nd file MAY contain the same value: Code:
$ cat file1 Code:
$ cat file2 processing file1 line by line: the 2nd field of the first line is "interesting value2". this value is present in last column of second line of file2, so this line of file1 ("random number2;interesting value2;randomnumber2") should be printed as the second line of the output. the 2nd field of the 2nd line ("interesting value3") is present in the last field of the 3rd line of file2. so the line "random n3;interesting value3;randomnumb3" of file1 should be printed as the 3rd line of the output. for both the 3rd and the 4th lines of file1, the values of the 2nd fields are not present in the last column of file2. so i don't mind the row number of these lines in the output. i.e.: Code:
$ sort-with-template.sh file1 -field=2 file2 -field=3 Code:
$ sort-with-template.sh file1 -field=2 file2 -field=3 thank you for your patience... |
i can try and post some code...
Code:
#!/bin/bash |
Here is my suggestion as an awk script. Save the script, then run it, specifying the files in reverse order, i.e. file2.csv file1.csv.
Code:
#!/usr/bin/awk -f The script then scans file1.csv. If there is a line number for the identifier in each record, then the current record is saved in the used array, indexed using the line number used for output. It also keeps the largest line number used for output in lines, since that is the minimum number of lines the script must output. Records with an identifier that has no specified output line number, are saved in the unused array, with a monotonically increasing index unuseds. The END rule is processed once after all records have been processed. Here, we have two loops. The first loop goes from 1 to lines, and outputs the record from the used array for that line if there is one. Otherwise, it will pick the next filler line from the unused array. (Note that this means the unused lines will be used in the order they were seen, not in random order. I think this should be most useful for you.) The second loop in the END rule just makes sure all filler lines have been output. This model is quite efficient wrt. disk I/O. Associative arrays in awk are quite fast, too: most awks use hashing and other efficient access algorithms. The only downside is that all records from file1.csv are read into memory first. The overall memory use is somewhat larger than that file. Hopefully that won't be an issue. (If you have a 64-bit distribution, you can easily handle much larger datasets than can fit in memory using most awk variants; the script will just cause a lot of swapping ("trashing"), and be quite slow. But it should work even then.) Questions? Comments? Suggestions? |
it works!
thank you... |
This is an interesting problem. The data in File1 must be reordered according to key matches which may be found in File2. Non-trivial.
LQ Senior Member Nominal Animal contributed an awk solution. I respect and admire Nominal's skill and have no wish to compete with him. However, my coding philosophy avoids explicit loops wherever possible on the premise that loops (especially loops in interpreted languages) are slower than one-pass commands. [Candid admission: this is not always true.] With this post I offer another proposed solution, one which does not use explicit loops. My solution has the disadvantage of freely using temporary files. Consequently the efficiency of loopless code may be offset by the I/O involved with those work files. I ask that OP masavini run my code and report execution times with further posts to this thread. We may all learn something from those results. File1 contents may be characterized this way: each line contains Code:
baggage;key;baggage Code:
key;seqnum;baggage;key;baggage Without further ado, this is my program in its entirety: Code:
#!/bin/bash |
Quote:
Personally, different approaches to solving the same problem is one of the reasons I'm a member here. I very much appreciate seeing others' solutions; doubly so when the approach/methodology is discussed or explained. Simply put, "competing" solutions are extremely valuable, in my opinion. I don't think of this as a competition, more like a friendly discussion regarding different approaches. That said, I'd like to suggest a small change with respect to temporary files: use an automatically deleted temporary directory to house all the temporary files: Code:
Work="$(mktemp -d)" || exit $? Instead of using $Work01 in your script, you'd use "$Work/01" for example. (Initially, $Work will always be a pristine, empty directory, so you can freely choose any file names you wish for the temporary files.) |
i'll be honest... i'm still using my old script version...
the problem with nominal animal solution was that "real" file1 has VERY long and "complex" lines... the last field of each line contains a long html page with javascripts and MANY special characters... while using nominal animal script, it happened that even if file1 and file2 had same row number, the output was shorter... a few lines were always missing and i had no time for proper debugging... this routine is in the middle of one of my most important scripts, so i need VERY stable code (even if it's a bit slower)... here is my actual code: Code:
awk -F ';' '{print $2}' file1 > titles1 |
Quote:
With regard to execution time: the small sample files provided by OP run in zero time with my code and probably yours too. Could you generate test files with 50,000 lines for a real horse race? Daniel B. Martin |
All times are GMT -5. The time now is 05:49 PM. |