LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-19-2011, 07:37 AM   #1
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Rep: Reputation: 0
Any grep, sed or awk gurus with regex familiarity? I need some help.


Hi guys,

I have an xml file. The file looks like this,
PHP Code:
<manufacturer_data>
<
action>INSERT</action>
<
mfr_id>100700</mfr_id>
<
local_content>0</local_content>
<
name>Tyson FoodsInc./Dinner Meats</name>
</
manufacturer_data
Its a huge file [1GB], so I cannot even open it using text editors etc. I need to extract mfr_ids from the file to do some analysis and comparisions.

I am pretty noob with text parsing and extraction. I want to know which tool and how to extract this info.

Right now if I use
PHP Code:
cat infile grep mfr_id 
prints the complete rows like so,
PHP Code:
<mfr_id>10</mfr_id>
<
mfr_id>100280</mfr_id>
<
mfr_id>100378</mfr_id>
<
mfr_id>100403</mfr_id>
<
mfr_id>100699</mfr_id>
<
mfr_id>100700</mfr_id>
<
mfr_id>100761</mfr_id>
<
mfr_id>100902</mfr_id>
<
mfr_id>101383</mfr_id>
<
mfr_id>101414</mfr_id>
<
mfr_id>1016</mfr_id
I want just the numbers. How would I do that?

Also, I want to know how to extract something similar based on a condition. Such as <name> based on <mfr_id>
such as
PHP Code:
 cat infile grep **<name>** WHERE <mfr_id>=xyz 
some like that....I know thats an abomination to grep syntax but I dint know how else to explain, Sorry.


Any help would be greatly appreciated, Thanks guys.
 
Old 01-19-2011, 07:59 AM   #2
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hello and Welcome to LinuxQuestions.org,

There are some real grep, sed and awk gurus here at LQ but I'm not one of them. Nevertheless I'm going to take a shot at your problem.
Code:
grep mfr_id <yourfile> | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > result
The above will get you all the numbers in a separate file called result. If you just want them printed on screen delete the > result part.

The same principle could be used for your 'abomination of grep'
Code:
grep mfr_id <yourfile> | grep name | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > outputfile
will get you all the lines that have mfr_id in them, next filter out the ones with name in them, then strip the 'code tags' and output to a file.

Have a look at these pretty good tutorials:
Sed
Awk

With these two tools you can perform miracles.

Kind regards,

Eric
 
Old 01-19-2011, 08:14 AM   #3
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
I'm not a sed guru either, but try the following:

Code:
sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output
 
1 members found this post helpful.
Old 01-19-2011, 08:17 AM   #4
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Quote:
Originally Posted by sycamorex View Post
I'm not a sed guru either, but try the following:

Code:
sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output
Hi sycamorex,

Great, I'm still in the learning process, your solution is not only shorter but uses one and only one tool.

Kind regards,

Eric
 
Old 01-19-2011, 08:26 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123
Perhaps a little more generic
Code:
sed -nr '/mfr_id/ s:[^[:digit:]]*([[:digit:]]+).*:\1:p' infile > output

Last edited by syg00; 01-19-2011 at 08:28 AM. Reason: typos
 
Old 01-19-2011, 08:28 AM   #6
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Hi,

Question number one - Assuming your example is accurate: awk -F"[<>]" '/fr_id/ { print $3 }' infile > outfile

Hope this helps.
 
1 members found this post helpful.
Old 01-19-2011, 08:29 AM   #7
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Actually, it can be done even shorter:

Code:
sed -n '/<\/*mfr_id>/ s///gp' infile > output
 
1 members found this post helpful.
Old 01-19-2011, 08:33 AM   #8
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016
 
Old 01-19-2011, 08:35 AM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Hi,

Question number two, again assuming your example is accurate:
Code:
awk ' BEGIN { RS="<manufacturer_data>" ; FS="\n" } { if ( $5 ~ /Tyson/) { gsub(/<[\/]*mfr_id>/,"",$3) ; print $3} }' infile
Bold part is your **name**

Hope this helps.
 
1 members found this post helpful.
Old 01-19-2011, 08:35 AM   #10
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by schneidz View Post
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016

The problem is that the OP wants to extract the numbers ONLY from lines containing the pattern mfr_id.
 
Old 01-19-2011, 08:36 AM   #11
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Quote:
Originally Posted by schneidz View Post
this works:
Code:
[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016
Yes, but not on the original input......
 
Old 01-19-2011, 08:38 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008

Rep: Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193
Maybe something along the lines of:
Code:
awk -vmfr_id=100700 'z{print;x=y=z=0}/^mfr_id$/{x=1}x && $0 ~ mfr_id{y=1}y && /^name$/{z=1}' RS="[<>]" file
For the initial simple case it would be:
Code:
awk 'x{print;x=0}/mfr_id/{x=1}' RS="[<>]" file
 
1 members found this post helpful.
Old 01-19-2011, 08:50 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008

Rep: Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193
Based on sycamorex's sed:
Code:
sed -rn 's@</?mfr_id>@@gp' file
 
2 members found this post helpful.
Old 01-19-2011, 08:52 AM   #14
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by grail View Post
Based on sycamorex's sed:
Code:
sed -rn 's@</?mfr_id>@@gp' file
... and I thought my sed would be the shortest one, LOL
Nice one!
 
Old 01-19-2011, 08:57 AM   #15
bcrawl
LQ Newbie
 
Registered: Jan 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Oh wow, lots of answers. Please give me some time to go through these. Thanks a lot for the help so far. This is wonderful showcase of practical usage of these tools for me.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep help or sed or awk dmchess Linux - Software 4 09-29-2010 06:53 PM
[SOLVED] Help using awk,sed and grep shakes82 Programming 34 07-07-2010 11:12 PM
help with grep/sed/awk nikunjbadjatya Programming 8 02-17-2010 07:29 PM
awk/sed to grep the text ahpin Linux - Software 3 10-17-2007 12:34 AM
Newbie SED / AWK / Regex command help request Critcho Linux - Newbie 10 03-19-2007 11:22 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:00 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration