ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
For large files, bash is slow and awk would be a better choice -- both for performance and because the language is naturally suited to sequential reading.
Assuming the CSV files are sorted by the third and fourth fields ascending, then this awk script should be pretty fast:
Code:
awk -v "file=file1.csv" '
BEGIN {
RS="[\r\n]+"
FS=","
OFS=","
have = split("", field)
while ((getline line < file) > 0) {
have = split(line, field)
if (field[3] == "" || field[4] == "") {
print line
continue
}
break
}
}
($3 == "" || $4 == "" || have < 1) {
print $0
next
}
{
if (field[3] "," field[4] < $3 "," $4)
while (1) {
print line
if ((getline line < file) < 1) {
have = split("", field)
break
}
have = split(line, field)
if (field[3] == "" || field[4] == "")
continue
if (field[3] "," field[4] >= $3 "," $4)
break
}
if ($3 == field[3] && $4 == field[4]) {
if ($1 == "") $1 = field[1]
if ($2 == "") $2 = field[2]
print $0
while (1) {
if ((getline line < file) < 1) {
have = split("", field)
break
}
have = split(line, field)
if (field[3] == "" || field[4] == "") {
print line
continue
}
break
}
} else
print $0
}
END {
while ((getline line < file) > 0)
print line
}
' "file2.csv"
The script reads the two CSV files in parallel. All input records (lines) (in either file) with an empty third or fourth field will be output as is. For each primary input file record, it will first read records from the secondary file until the two seem to be in sync. (This is why the files need to be sorted by the third and fourth fields. You can use sort -t , -k 3 fileN.csv > sortedN.csv to sort the CSV files.) If the two files share the record, the merged record is output, and another record must then be read from the secondary file. Otherwise the primary input record is printed. Finally, if the primary file runs out first, the END section prints out the remaining secondary file records.
For me, this seems to reproduce your desired output.
Well try and digest what is happening in each script. The general idea should be fairly clear though, ie one files data is being looked at via the fields ($3, $4 in examples) and the
other file is being split into an array (using my example that would be f[3] and f[4]).
Using this philosophy you can easily see which fields and array items you should compare for your most recent question.
If you follow your code through line by line, the scenario is:
Once on the second loop through scratch2 you will hit the following code:
Code:
if ($1 == field[1] && $2 == field[2]) {
if ($3 == "") $3 = field[3]
print $0 # this will print the second line from scratch2
while (1) {
if ((getline line < file) < 1) { # This will get the last line from scratch1
have = split("", field)
break
}
have = split(line, field) # split entries into field
if (field[1] == "" || field[2] == "") {
print line
continue
}
break # break out of loop
}
}
From the last entry above it will now look for the next entry in scratch2, which there is none.
So now it goes into the END section:
Code:
END {
while ((getline line < file) > 0) # this will be 0 as we already have the last line previously
print line
}
As you can see, the entry already retrieved from scratch1 has been discarded due to the getline in END, prior to it being tested
and therefore printed.
i mean, the "problematic" line is the last of scratch1:
product01;code02;
Code:
{
if ($1 == field[1] && $2 == field[2]) { <- this condition is not satisfied, so it should skip down
if ($3 == "") $3 = field[3]
print $0
while (1) {
if ((getline line < file) < 1) {
have = split("", field)
break
}
have = split(line, field)
if (field[1] == "" || field[2] == "") {
print line
continue
}
break
}
} else
print $0 <-- HERE and print the line as is
}
i feel like back school again... and like if i didn't study enough...
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file. in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)...
what am i missing?
btw, i'm about for dropping and doing it with old good bash:
Code:
while read LINE
do
CHECK=$(echo "$LINE" | awk -F ';' '{print $1";"$2}')
rm scratch
cat file1 | grep "^$CHECK;" > scratch
if [ -e scratch ] && [ "$(cat scratch | awk -F ';' '{print $3}' | head -1)" != "" ]; then
cat scratch >> new-file
echo "$LINE" >> new-file
sed -i /"^$CHECK;"/d file1
else
echo "$LINE" >> new-file
fi
done < file2
cat new-file | sed -n '/[^;]*;[^;]*;[^;]/p' > new-file2
cat file1 new-file2 | sort | uniq > output
that line should check if the fields #1 and #2 in the 1st file are equal to the fields #1 and #2 in the 2nd file
This part is correct
Quote:
in the case of the line that gives me the problem, it should return false and go directly in the "else" instruction (print $0)
But this is not. You are not following at what point you are up to in each file.
So let us step through the code:
1. The BEGIN - apart from setting a few variables, the only thing this gives us is the first line of scratch1 split into the fields array
2. Testing for empty fields $1 and $2 - again this is never used in our particular example
3. So the following section of code is our focus:
Code:
{
if ($1 == field[1] && $2 == field[2]) {
if ($3 == "") $3 = field[3]
print $0
while (1) {
if ((getline line < file) < 1) {
have = split("", field)
break
}
have = split(line, field)
if (field[1] == "" || field[2] == "") {
print line
continue
}
break
}
} else
print $0
}
3a. Are first and second fields from both files equal - for the first line in our file this is true and $3 from scratch2 is not empty so the line below is printed:
Code:
product01;code01;brand1
3b. Enter the while loop - get the next line from scratch1 which does exist so if is false.
- split into field array and test if either field is empty - they are not so if is false and break is executed (ie we leave the while loop)
3c. As per point 2 above, test is false. Test if items from both files are equal (remembering that field array was set on previous pass) - again they are both equal and $3 in scratch2 is not empty
so print line below:
Code:
product01;code01;brand2
3d. This is a copy of all steps performed in 3b above
Here is where you seem to be getting lost
3e. There are now no more lines in scratch2 to be read so we now jump to END (notice how this differs from 3c, ie we never reach the test you were thinking of). So END says:
Code:
END {
while ((getline line < file) > 0)
print line
}
So it requests that we get the next line from scratch1, but we already retrieved the last line at 3d above. So now the getline test will fail as it is not greater than zero, hence
the while loop is immediately exited and therefore so is the script.
I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.
I can also tell you that there are plenty of changes that can be made to your bash, awk, sed, grep and cat script. The main one being that the use of so many different
command are almost never required. The first would be that nearly every call to cat is not required, but that is a story for another time
Let me know if you do not follow the logic as supplied above?
I know I have not given the answer exactly, but I am happy to help you learn (seriously not trying to treat you like at school ).
I personally find it much more gratifying to come across the solution myself.
i find learning gratifying as well, but if gratifications are balanced or exceeded by frustrations, it's not a deal: the question i posted is just one of the dozens issues i'm finding writing this awk script, so i guess i'd better give up...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.