LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-21-2012, 01:00 PM   #1
alexwely
LQ Newbie
 
Registered: Mar 2012
Posts: 8

Rep: Reputation: Disabled
AWK parsing a CVS file with a seperate list file


I am trying to find a way to parse a csv file, pulling out lines where the fourth field matches any value in a list file. I know you can use something like "Fgrep -f list input.csv" and that will pull out the lines matching any instance from the list but in my particular case i specifically need to match only field four... What i am currently doing is using a loop to cut out the fourth field passing it through grep again then printing the line to a file if it matches... I think there is just an easier way to do it in awk or maybe perl. Also performance is crucial here since the source file can have over 100K lines and the pattern list can have about 1000 lines.


So my code is:

Code:
echo -e "Do Run1"
   while read LINE
       do
CID=`echo -e $LINE | cut -d"," -f4`
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
LASTRet=$?

if [ ${LASTRet} -eq 0 ]; then

echo -e ${LINE} >> results.csv
fi
fi
done < input.csv

As you can see this can take forever through a large file



Thanks!
 
Old 03-21-2012, 01:16 PM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Please show the format and some data for the 2 files?

Would you also please explain the concept behind the following line:
Code:
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
 
Old 03-21-2012, 01:30 PM   #3
alexwely
LQ Newbie
 
Registered: Mar 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Please show the format and some data for the 2 files?

Would you also please explain the concept behind the following line:
Code:
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt


Lets say the input csv file is composed of the folliwing data

Code:
20120315,152638,0010000119,224,UT01,foobar,NVLS,D,0.00,3000,3000,0,48.4091,,,20120315886
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM

And the pattern list has the following:

Code:
4589
1455
2236
325
4678
Basically I just want the output to be:

Code:
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM

In my real world example due to some preprocessing the pattern list has numbers with the commas in them like:

Code:
,4589,
,1455,
,2236,
,325,
,4678,
So it just adds them to the comparison so they can match
 
Old 03-21-2012, 02:08 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Well using the real world example I would do something like:
Code:
awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv
 
Old 03-21-2012, 02:32 PM   #5
alexwely
LQ Newbie
 
Registered: Mar 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Well using the real world example I would do something like:
Code:
awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv
That doesn't exactly work because lets take the example of

20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,

when you compare it to the pattern list it will match patterns that are say like:

5325
3259

In my real world pattern list i actually have the commas in place there because it limits the pattern matching to exactly the pattern

For example... the previous line should only match a pattern of

,325,

and not

,3256,

Is there a way to do that in the one liner? To actually include the commas and field four and have it evaluate as ,325, and not just 325?
 
Old 03-21-2012, 02:43 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.
 
Old 03-21-2012, 03:06 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Indeed. grail's approach isn't using pattern matching. :}
 
Old 03-22-2012, 08:51 AM   #8
alexwely
LQ Newbie
 
Registered: Mar 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Smile

Quote:
Originally Posted by grail View Post
Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.
This actually worked with a minor modification in the pattern list... Thanks a lot totally saved me so much time... instead of a 2 hour run this finished in about 20 seconds!

Alex
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with Sed/Grep/Awk for file parsing StupidNewbie Programming 10 03-18-2012 01:52 PM
[SOLVED] Parsing help. Using grep and awk for the creation of a configuration file. dragos240 Linux - Software 2 03-05-2012 02:19 AM
parsing a text file - to awk or not to awk ? rollyah Programming 9 08-18-2011 02:20 PM
Parsing log file with awk sebelk Programming 1 08-31-2009 08:47 AM
awk question - parsing xml file epoo Programming 7 01-24-2007 02:13 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration