[SOLVED] xml parsing using sed?

bcrawl · 01-20-2011, 02:53 PM

Hey guys,

I have a huge xml file like this...

Code:

<manufacturers>

<manufacturer_data>
<action>UPDATE</action>
<mfr_id>6515951</mfr_id>
<local_content>0</local_content>
<name>Johnsonville Sausage, Llc</name>
</manufacturer_data>

<manufacturer_data>
<action>INSERT</action>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Foodmark</name>
</manufacturer_data>

</manufacturers>

<brands>

<brand_data>
<action>INSERT</action>
<brand_id>6594088</brand_id>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Good Food Made Simple</name>
</brand_data>

<brand_data>
<action>INSERT</action>
<brand_id>6523125</brand_id>
<mfr_id>105873</mfr_id>
<local_content>0</local_content>
<name>Hawaiian(Tm) Kettle Style Potato Chips</name>
</brand_data>
<brand_data>
</brands>

Yesterday I asked for assistance to extract mfr_id from the list and I used

Code:

grep mfr_id | sed -rn 's@</?mfr_id>@@gp'

to extract the data/ids which I later then sorted and removed duplicates for my actual analysis.

Today, I am looking to extract <mfr_id> and <name> from <manufacturer_data>

Issues I am having.
- sed is extracting all instances of <name>

So I need to
- tell sed to "hold" data between <manufactuer_data> tags and do pattern search to strip <mfr_id> and <name> tags and print them into columns.

This is a little above league. Can some one help me out?

Tinkster · 01-20-2011, 06:57 PM

Quote:

Originally Posted by bcrawl

Hey guys,

I have a huge xml file like this...

Code:

<manufacturers>

<manufacturer_data>
<action>UPDATE</action>
<mfr_id>6515951</mfr_id>
<local_content>0</local_content>
<name>Johnsonville Sausage, Llc</name>
</manufacturer_data>

<manufacturer_data>
<action>INSERT</action>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Foodmark</name>
</manufacturer_data>

</manufacturers>

<brands>

<brand_data>
<action>INSERT</action>
<brand_id>6594088</brand_id>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Good Food Made Simple</name>
</brand_data>

<brand_data>
<action>INSERT</action>
<brand_id>6523125</brand_id>
<mfr_id>105873</mfr_id>
<local_content>0</local_content>
<name>Hawaiian(Tm) Kettle Style Potato Chips</name>
</brand_data>
<brand_data>
</brands>

Yesterday I asked for assistance to extract mfr_id from the list and I used

Code:

grep mfr_id | sed -rn 's@</?mfr_id>@@gp'

to extract the data/ids which I later then sorted and removed duplicates for my actual analysis.

Today, I am looking to extract <mfr_id> and <name> from <manufacturer_data>

Issues I am having.
- sed is extracting all instances of <name>

So I need to
- tell sed to "hold" data between <manufactuer_data> tags and do pattern search to strip <mfr_id> and <name> tags and print them into columns.

This is a little above league. Can some one help me out?

I'm sure this can be done w/ sed, but I'd use awk for this one:

Code:

awk '/<manufacturers>/,/<\/manufacturers>/{if($0~/<name>/){print gensub(/.*>([^<]+)<.*/,"\\1","1")}}' hooga.xml
Johnsonville Sausage, Llc
Foodmark

Btw, the grep statement in your solution above was superfluous.

Cheers,
Tink

grail · 01-21-2011, 01:34 AM

The sed looks kinda the same:

Code:

sed -rn '/<manufacturers>/,/<\/manufacturers>/s@</?name>@@pg' file

bcrawl · 01-24-2011, 01:57 PM

Thanks guys, both commands worked. I thought I replied to this thread but now when I was cross checking the thread I realized my response never got posted. I deeply apologize. I used awk example in this case.