Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to
LinuxQuestions.org , a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free.
Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please
contact us . If you need to reset your password,
click here .
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a
virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month.
Click here for more info.
01-19-2011, 07:37 AM
#1
LQ Newbie
Registered: Jan 2011
Posts: 11
Rep:
Any grep, sed or awk gurus with regex familiarity? I need some help.
Hi guys,
I have an xml file. The file looks like this,
PHP Code:
< manufacturer_data > < action > INSERT </ action > < mfr_id > 100700 </ mfr_id > < local_content > 0 </ local_content > < name > Tyson Foods , Inc ./ Dinner Meats </ name > </ manufacturer_data >
Its a huge file [1GB], so I cannot even open it using text editors etc. I need to extract mfr_ids from the file to do some analysis and comparisions.
I am pretty noob with text parsing and extraction. I want to know which tool and how to extract this info.
Right now if I use
PHP Code:
cat infile | grep mfr_id
prints the complete rows like so,
PHP Code:
< mfr_id > 10 </ mfr_id > < mfr_id > 100280 </ mfr_id > < mfr_id > 100378 </ mfr_id > < mfr_id > 100403 </ mfr_id > < mfr_id > 100699 </ mfr_id > < mfr_id > 100700 </ mfr_id > < mfr_id > 100761 </ mfr_id > < mfr_id > 100902 </ mfr_id > < mfr_id > 101383 </ mfr_id > < mfr_id > 101414 </ mfr_id > < mfr_id > 1016 </ mfr_id >
I want just the numbers. How would I do that?
Also, I want to know how to extract something similar based on a condition. Such as <name> based on <mfr_id>
such as
PHP Code:
cat infile | grep **< name >** WHERE < mfr_id >= xyz
some like that....I know thats an abomination to grep syntax but I dint know how else to explain, Sorry.
Any help would be greatly appreciated, Thanks guys.
01-19-2011, 07:59 AM
#2
LQ Guru
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Hello and Welcome to LinuxQuestions.org,
There are some real grep, sed and awk gurus here at LQ but I'm not one of them. Nevertheless I'm going to take a shot at your problem.
Code:
grep mfr_id <yourfile> | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > result
The above will get you all the numbers in a separate file called result. If you just want them printed on screen delete the
> result part.
The same principle could be used for your 'abomination of grep'
Code:
grep mfr_id <yourfile> | grep name | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > outputfile
will get you all the lines that have mfr_id in them, next filter out the ones with name in them, then strip the 'code tags' and output to a file.
Have a look at these pretty good tutorials:
Sed
Awk
With these two tools you can perform miracles.
Kind regards,
Eric
01-19-2011, 08:14 AM
#3
LQ Veteran
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
I'm not a sed guru either, but try the following:
Code:
sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output
1 members found this post helpful.
01-19-2011, 08:17 AM
#4
LQ Guru
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Quote:
Originally Posted by
sycamorex
I'm not a sed guru either, but try the following:
Code:
sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output
Hi sycamorex,
Great, I'm still in the learning process, your solution is not only shorter but uses one and only one tool.
Kind regards,
Eric
01-19-2011, 08:26 AM
#5
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140
Perhaps a little more generic
Code:
sed -nr '/mfr_id/ s:[^[:digit:]]*([[:digit:]]+).*:\1:p' infile > output
Last edited by syg00; 01-19-2011 at 08:28 AM .
Reason: typos
01-19-2011, 08:28 AM
#6
LQ Veteran
Registered: Sep 2003
Posts: 10,532
Hi,
Question number one - Assuming your example is accurate: awk -F"[<>]" '/fr_id/ { print $3 }' infile > outfile
Hope this helps.
1 members found this post helpful.
01-19-2011, 08:29 AM
#7
LQ Veteran
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Actually, it can be done even shorter:
Code:
sed -n '/<\/*mfr_id>/ s///gp' infile > output
1 members found this post helpful.
01-19-2011, 08:33 AM
#8
LQ Guru
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016
01-19-2011, 08:35 AM
#9
LQ Veteran
Registered: Sep 2003
Posts: 10,532
Hi,
Question number two, again assuming your example is accurate:
Code:
awk ' BEGIN { RS="<manufacturer_data>" ; FS="\n" } { if ( $5 ~ /Tyson /) { gsub(/<[\/]*mfr_id>/,"",$3) ; print $3} }' infile
Bold part is your **name**
Hope this helps.
1 members found this post helpful.
01-19-2011, 08:35 AM
#10
LQ Veteran
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Quote:
Originally Posted by
schneidz
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016
The problem is that the OP wants to extract the numbers ONLY from lines containing the pattern mfr_id.
01-19-2011, 08:36 AM
#11
LQ Veteran
Registered: Sep 2003
Posts: 10,532
Quote:
Originally Posted by
schneidz
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016
Yes, but not on the original input......
01-19-2011, 08:38 AM
#12
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008
Maybe something along the lines of:
Code:
awk -vmfr_id=100700 'z{print;x=y=z=0}/^mfr_id$/{x=1}x && $0 ~ mfr_id{y=1}y && /^name$/{z=1}' RS="[<>]" file
For the initial simple case it would be:
Code:
awk 'x{print;x=0}/mfr_id/{x=1}' RS="[<>]" file
1 members found this post helpful.
01-19-2011, 08:50 AM
#13
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008
Based on sycamorex's sed:
Code:
sed -rn 's@</?mfr_id>@@gp' file
2 members found this post helpful.
01-19-2011, 08:52 AM
#14
LQ Veteran
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Quote:
Originally Posted by
grail
Based on sycamorex's sed:
Code:
sed -rn 's@</?mfr_id>@@gp' file
... and I thought my sed would be the shortest one, LOL
Nice one!
01-19-2011, 08:57 AM
#15
LQ Newbie
Registered: Jan 2011
Posts: 11
Original Poster
Rep:
Oh wow, lots of answers. Please give me some time to go through these. Thanks a lot for the help so far. This is wonderful showcase of practical usage of these tools for me.
All times are GMT -5. The time now is 03:00 AM .
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know .
Latest Threads
LQ News