Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
03-21-2012, 01:00 PM
|
#1
|
LQ Newbie
Registered: Mar 2012
Posts: 8
Rep: 
|
AWK parsing a CVS file with a seperate list file
I am trying to find a way to parse a csv file, pulling out lines where the fourth field matches any value in a list file. I know you can use something like "Fgrep -f list input.csv" and that will pull out the lines matching any instance from the list but in my particular case i specifically need to match only field four... What i am currently doing is using a loop to cut out the fourth field passing it through grep again then printing the line to a file if it matches... I think there is just an easier way to do it in awk or maybe perl. Also performance is crucial here since the source file can have over 100K lines and the pattern list can have about 1000 lines.
So my code is:
Code:
echo -e "Do Run1"
while read LINE
do
CID=`echo -e $LINE | cut -d"," -f4`
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
LASTRet=$?
if [ ${LASTRet} -eq 0 ]; then
echo -e ${LINE} >> results.csv
fi
fi
done < input.csv
As you can see this can take forever through a large file
Thanks!
|
|
|
03-21-2012, 01:16 PM
|
#2
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,037
|
Please show the format and some data for the 2 files?
Would you also please explain the concept behind the following line:
Code:
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
|
|
|
03-21-2012, 01:30 PM
|
#3
|
LQ Newbie
Registered: Mar 2012
Posts: 8
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
Please show the format and some data for the 2 files?
Would you also please explain the concept behind the following line:
Code:
echo -e ",${CID}," | /bin/grep -f CIDIDs.txt
|
Lets say the input csv file is composed of the folliwing data
Code:
20120315,152638,0010000119,224,UT01,foobar,NVLS,D,0.00,3000,3000,0,48.4091,,,20120315886
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM
And the pattern list has the following:
Code:
4589
1455
2236
325
4678
Basically I just want the output to be:
Code:
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
20120315,075103,0020000220,4678,A998,OS,JYF,XX,32.5,2000,0,0,,ALGO,xas,159873-1312-42,WM
In my real world example due to some preprocessing the pattern list has numbers with the commas in them like:
Code:
,4589,
,1455,
,2236,
,325,
,4678,
So it just adds them to the comparison so they can match
|
|
|
03-21-2012, 02:08 PM
|
#4
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,037
|
Well using the real world example I would do something like:
Code:
awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv
|
|
|
03-21-2012, 02:32 PM
|
#5
|
LQ Newbie
Registered: Mar 2012
Posts: 8
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
Well using the real world example I would do something like:
Code:
awk -F, 'FNR==NR{list[$2];next}$4 in list' pattern input.csv
|
That doesn't exactly work because lets take the example of
20120315,102707,0015000000,325,ESMT,,NWSA,X,20.15,3000,3000,0,20.1200,,,,
when you compare it to the pattern list it will match patterns that are say like:
5325
3259
In my real world pattern list i actually have the commas in place there because it limits the pattern matching to exactly the pattern
For example... the previous line should only match a pattern of
,325,
and not
,3256,
Is there a way to do that in the one liner? To actually include the commas and field four and have it evaluate as ,325, and not just 325?
|
|
|
03-21-2012, 02:43 PM
|
#6
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,037
|
Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.
|
|
|
03-21-2012, 03:06 PM
|
#7
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Indeed. grail's approach isn't using pattern matching. :}
|
|
|
03-22-2012, 08:51 AM
|
#8
|
LQ Newbie
Registered: Mar 2012
Posts: 8
Original Poster
Rep: 
|
Quote:
Originally Posted by grail
Did you test it? The array 'list' has indexes equal to exactly each value in the pattern, therefore, as neither 5325 or 3259 is equal to 325 it will not be in the array and hence not printed.
|
This actually worked with a minor modification in the pattern list... Thanks a lot totally saved me so much time... instead of a 2 hour run this finished in about 20 seconds!
Alex 
|
|
|
All times are GMT -5. The time now is 03:07 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|