LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Any grep, sed or awk gurus with regex familiarity? I need some help. (http://www.linuxquestions.org/questions/linux-newbie-8/any-grep-sed-or-awk-gurus-with-regex-familiarity-i-need-some-help-857258/)

bcrawl 01-19-2011 07:37 AM

Any grep, sed or awk gurus with regex familiarity? I need some help.
 
Hi guys,

I have an xml file. The file looks like this,
PHP Code:

<manufacturer_data>
<
action>INSERT</action>
<
mfr_id>100700</mfr_id>
<
local_content>0</local_content>
<
name>Tyson FoodsInc./Dinner Meats</name>
</
manufacturer_data

Its a huge file [1GB], so I cannot even open it using text editors etc. I need to extract mfr_ids from the file to do some analysis and comparisions.

I am pretty noob with text parsing and extraction. I want to know which tool and how to extract this info.

Right now if I use
PHP Code:

cat infile grep mfr_id 

prints the complete rows like so,
PHP Code:

<mfr_id>10</mfr_id>
<
mfr_id>100280</mfr_id>
<
mfr_id>100378</mfr_id>
<
mfr_id>100403</mfr_id>
<
mfr_id>100699</mfr_id>
<
mfr_id>100700</mfr_id>
<
mfr_id>100761</mfr_id>
<
mfr_id>100902</mfr_id>
<
mfr_id>101383</mfr_id>
<
mfr_id>101414</mfr_id>
<
mfr_id>1016</mfr_id

I want just the numbers. How would I do that?

Also, I want to know how to extract something similar based on a condition. Such as <name> based on <mfr_id>
such as
PHP Code:

 cat infile grep **<name>** WHERE <mfr_id>=xyz 

some like that....I know thats an abomination to grep syntax but I dint know how else to explain, Sorry.


Any help would be greatly appreciated, Thanks guys.

EricTRA 01-19-2011 07:59 AM

Hello and Welcome to LinuxQuestions.org,

There are some real grep, sed and awk gurus here at LQ but I'm not one of them. Nevertheless I'm going to take a shot at your problem.
Code:

grep mfr_id <yourfile> | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > result
The above will get you all the numbers in a separate file called result. If you just want them printed on screen delete the > result part.

The same principle could be used for your 'abomination of grep'
Code:

grep mfr_id <yourfile> | grep name | sed -e 's/<mfr_id>//g' -e 's/<\/mfr_id>//g' > outputfile
will get you all the lines that have mfr_id in them, next filter out the ones with name in them, then strip the 'code tags' and output to a file.

Have a look at these pretty good tutorials:
Sed
Awk

With these two tools you can perform miracles.

Kind regards,

Eric

sycamorex 01-19-2011 08:14 AM

I'm not a sed guru either, but try the following:

Code:

sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output

EricTRA 01-19-2011 08:17 AM

Quote:

Originally Posted by sycamorex (Post 4230109)
I'm not a sed guru either, but try the following:

Code:

sed -n '/mfr_id/ s/<\/*mfr_id>//gp' infile > output

Hi sycamorex,

Great, I'm still in the learning process, your solution is not only shorter but uses one and only one tool.

Kind regards,

Eric

syg00 01-19-2011 08:26 AM

Perhaps a little more generic
Code:

sed -nr '/mfr_id/ s:[^[:digit:]]*([[:digit:]]+).*:\1:p' infile > output

druuna 01-19-2011 08:28 AM

Hi,

Question number one - Assuming your example is accurate: awk -F"[<>]" '/fr_id/ { print $3 }' infile > outfile

Hope this helps.

sycamorex 01-19-2011 08:29 AM

Actually, it can be done even shorter:

Code:

sed -n '/<\/*mfr_id>/ s///gp' infile > output

schneidz 01-19-2011 08:33 AM

this works:
Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id> 
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016


druuna 01-19-2011 08:35 AM

Hi,

Question number two, again assuming your example is accurate:
Code:

awk ' BEGIN { RS="<manufacturer_data>" ; FS="\n" } { if ( $5 ~ /Tyson/) { gsub(/<[\/]*mfr_id>/,"",$3) ; print $3} }' infile
Bold part is your **name**

Hope this helps.

sycamorex 01-19-2011 08:35 AM

Quote:

Originally Posted by schneidz (Post 4230139)
this works:
Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id> 
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016



The problem is that the OP wants to extract the numbers ONLY from lines containing the pattern mfr_id.

druuna 01-19-2011 08:36 AM

Quote:

Originally Posted by schneidz (Post 4230139)
this works:
Code:

[schneidz@hyper temp]$ cat mfr.txt
<mfr_id>10</mfr_id>
<mfr_id>100280</mfr_id>
<mfr_id>100378</mfr_id>
<mfr_id>100403</mfr_id>
<mfr_id>100699</mfr_id>
<mfr_id>100700</mfr_id>
<mfr_id>100761</mfr_id>
<mfr_id>100902</mfr_id>
<mfr_id>101383</mfr_id>
<mfr_id>101414</mfr_id>
<mfr_id>1016</mfr_id> 
[schneidz@hyper temp]$ awk -F "[><]" '{print $3}' mfr.txt
10
100280
100378
100403
100699
100700
100761
100902
101383
101414
1016


Yes, but not on the original input......

grail 01-19-2011 08:38 AM

Maybe something along the lines of:
Code:

awk -vmfr_id=100700 'z{print;x=y=z=0}/^mfr_id$/{x=1}x && $0 ~ mfr_id{y=1}y && /^name$/{z=1}' RS="[<>]" file
For the initial simple case it would be:
Code:

awk 'x{print;x=0}/mfr_id/{x=1}' RS="[<>]" file

grail 01-19-2011 08:50 AM

Based on sycamorex's sed:
Code:

sed -rn 's@</?mfr_id>@@gp' file

sycamorex 01-19-2011 08:52 AM

Quote:

Originally Posted by grail (Post 4230163)
Based on sycamorex's sed:
Code:

sed -rn 's@</?mfr_id>@@gp' file

... and I thought my sed would be the shortest one, LOL
Nice one!

bcrawl 01-19-2011 08:57 AM

Oh wow, lots of answers. Please give me some time to go through these. Thanks a lot for the help so far. This is wonderful showcase of practical usage of these tools for me.


All times are GMT -5. The time now is 05:56 AM.