LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   AWK parsing a CVS file with a seperate list file (https://www.linuxquestions.org/questions/programming-9/awk-parsing-a-cvs-file-with-a-seperate-list-file-935724/)

alexwely 03-21-2012 01:00 PM

AWK parsing a CVS file with a seperate list file
 
I am trying to find a way to parse a csv file, pulling out lines where the fourth field matches any value in a list file. I know you can use something like "Fgrep -f list input.csv" and that will pull out the lines matching any instance from the list but in my particular case i specifically need to match only field four... What i am currently doing is using a loop to cut out the fourth field passing it through grep again then printing the line to a file if it matches... I think there is just an easier way to do it in awk or maybe perl. Also performance is crucial here since the source file can have over 100K lines and the pattern list can have about 1000 lines.


So my code is:

Code:

echo -e "Do Run1"
  while read LINE
      do
CID=`echo -e $LINE | cut -d"," -f4`
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
LASTRet=$?

if [ ${LASTRet} -eq 0 ]; then

echo -e ${LINE} >> results.csv
fi
fi
done < input.csv


As you can see this can take forever through a large file



Thanks!

grail 03-21-2012 01:16 PM

Please show the format and some data for the 2 files?

Would you also please explain the concept behind the following line:
Code:

echo -e ",${CID}," | /bin/grep -f CIDIDs.txt

alexwely 03-21-2012 01:30 PM

Quote:

Originally Posted by grail (Post 4632773)
Please show the format and some data for the 2 files?

Would you also please explain the concept behind the following line:
Code:

echo -e ",${CID}," | /bin/grep -f CIDIDs.txt



Lets say the input csv file is composed of the folliwing data

Code:

20120315,152638,0010000119,224,UT01,foobar,NVLS,D,0.00,3000,3000,0,48.4091,,,20120315886
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM


And the pattern list has the following:

Code:

4589
1455
2236
325
4678

Basically I just want the output to be:

Code:

20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM


In my real world example due to some preprocessing the pattern list has numbers with the commas in them like:

Code:

,4589,
,1455,
,2236,
,325,
,4678,

So it just adds them to the comparison so they can match

grail 03-21-2012 02:08 PM

Well using the real world example I would do something like:
Code:

awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv

alexwely 03-21-2012 02:32 PM

Quote:

Originally Posted by grail (Post 4632828)
Well using the real world example I would do something like:
Code:

awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv

That doesn't exactly work because lets take the example of

20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,

when you compare it to the pattern list it will match patterns that are say like:

5325
3259

In my real world pattern list i actually have the commas in place there because it limits the pattern matching to exactly the pattern

For example... the previous line should only match a pattern of

,325,

and not

,3256,

Is there a way to do that in the one liner? To actually include the commas and field four and have it evaluate as ,325, and not just 325?

grail 03-21-2012 02:43 PM

Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.

Tinkster 03-21-2012 03:06 PM

Indeed. grail's approach isn't using pattern matching. :}

alexwely 03-22-2012 08:51 AM

Quote:

Originally Posted by grail (Post 4632864)
Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.

This actually worked with a minor modification in the pattern list... Thanks a lot totally saved me so much time... instead of a 2 hour run this finished in about 20 seconds!

Alex :)


All times are GMT -5. The time now is 10:37 PM.