[SOLVED] Any grep, sed or awk gurus with regex familiarity? I need some help.

bcrawl · 01-19-2011, 07:37 AM

Hi guys,

I have an xml file. The file looks like this,

PHP Code:



<manufacturer_data> 
<action>INSERT</action> 
<mfr_id>100700</mfr_id> 
<local_content>0</local_content> 
<name>Tyson Foods, Inc./Dinner Meats</name> 
</manufacturer_data>

Its a huge file [1GB], so I cannot even open it using text editors etc. I need to extract mfr_ids from the file to do some analysis and comparisions.

I am pretty noob with text parsing and extraction. I want to know which tool and how to extract this info.

Right now if I use

PHP Code:



cat infile | grep mfr_id

prints the complete rows like so,

PHP Code:



<mfr_id>10</mfr_id> 
<mfr_id>100280</mfr_id> 
<mfr_id>100378</mfr_id> 
<mfr_id>100403</mfr_id> 
<mfr_id>100699</mfr_id> 
<mfr_id>100700</mfr_id> 
<mfr_id>100761</mfr_id> 
<mfr_id>100902</mfr_id> 
<mfr_id>101383</mfr_id> 
<mfr_id>101414</mfr_id> 
<mfr_id>1016</mfr_id>

I want just the numbers. How would I do that?

Also, I want to know how to extract something similar based on a condition. Such as <name> based on <mfr_id>
such as

PHP Code:



 cat infile | grep **<name>** WHERE <mfr_id>=xyz

some like that....I know thats an abomination to grep syntax but I dint know how else to explain, Sorry.

Any help would be greatly appreciated, Thanks guys.

EricTRA · 01-19-2011, 07:59 AM

Hello and Welcome to LinuxQuestions.org,

There are some real grep, sed and awk gurus here at LQ but I'm not one of them. Nevertheless I'm going to take a shot at your problem.

Code:

grep mfr_id <yourfile> | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > result

The above will get you all the numbers in a separate file called result. If you just want them printed on screen delete the > result part.

The same principle could be used for your 'abomination of grep'

Code:

grep mfr_id <yourfile> | grep name | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > outputfile

will get you all the lines that have mfr_id in them, next filter out the ones with name in them, then strip the 'code tags' and output to a file.

Have a look at these pretty good tutorials:
Sed
Awk

With these two tools you can perform miracles.

Kind regards,

Eric

sycamorex · 01-19-2011, 08:14 AM

I'm not a sed guru either, but try the following:

Code:

sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output

EricTRA · 01-19-2011, 08:17 AM

Quote:

Originally Posted by sycamorex

I'm not a sed guru either, but try the following:

Code:

sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output

Hi sycamorex,

Great, I'm still in the learning process, your solution is not only shorter but uses one and only one tool.

Kind regards,

Eric

syg00 · 01-19-2011, 08:26 AM

Perhaps a little more generic

Code:

sed -nr '/mfr_id/ s:[^[:digit:]]*([[:digit:]]+).*:\1:p' infile > output

druuna · 01-19-2011, 08:28 AM

Hi,

Question number one - Assuming your example is accurate: awk -F"[<>]" '/fr_id/ { print $3 }' infile > outfile

Hope this helps.

sycamorex · 01-19-2011, 08:29 AM

Actually, it can be done even shorter:

Code:

sed -n '/<\/*mfr_id>/ s///gp' infile > output

schneidz · 01-19-2011, 08:33 AM

this works:

Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016

druuna · 01-19-2011, 08:35 AM

Hi,

Question number two, again assuming your example is accurate:

Code:

awk ' BEGIN { RS="<manufacturer_data>" ; FS="\n" } { if ( $5 ~ /Tyson/) { gsub(/<[\/]*mfr_id>/,"",$3) ; print $3} }' infile

Bold part is your **name**

Hope this helps.

sycamorex · 01-19-2011, 08:35 AM

Quote:

Originally Posted by schneidz

this works:

Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016

The problem is that the OP wants to extract the numbers ONLY from lines containing the pattern mfr_id.

druuna · 01-19-2011, 08:36 AM

Quote:

Originally Posted by schneidz

this works:

Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id>  
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016

Yes, but not on the original input......

grail · 01-19-2011, 08:38 AM

Maybe something along the lines of:

Code:

awk -vmfr_id=100700 'z{print;x=y=z=0}/^mfr_id$/{x=1}x && $0 ~ mfr_id{y=1}y && /^name$/{z=1}' RS="[<>]" file

For the initial simple case it would be:

Code:

awk 'x{print;x=0}/mfr_id/{x=1}' RS="[<>]" file

grail · 01-19-2011, 08:50 AM

Based on sycamorex's sed:

Code:

sed -rn 's@</?mfr_id>@@gp' file

sycamorex · 01-19-2011, 08:52 AM

Quote:

Originally Posted by grail

Based on sycamorex's sed:

Code:

sed -rn 's@</?mfr_id>@@gp' file

... and I thought my sed would be the shortest one, LOL
Nice one!

bcrawl · 01-19-2011, 08:57 AM

Oh wow, lots of answers. Please give me some time to go through these. Thanks a lot for the help so far. This is wonderful showcase of practical usage of these tools for me.