LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   xml parsing using sed? (https://www.linuxquestions.org/questions/linux-newbie-8/xml-parsing-using-sed-857623/)

bcrawl 01-20-2011 02:53 PM

xml parsing using sed?
 
Hey guys,

I have a huge xml file like this...
Code:

<manufacturers>

<manufacturer_data>
<action>UPDATE</action>
<mfr_id>6515951</mfr_id>
<local_content>0</local_content>
<name>Johnsonville Sausage, Llc</name>
</manufacturer_data>

<manufacturer_data>
<action>INSERT</action>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Foodmark</name>
</manufacturer_data>

</manufacturers>

<brands>

<brand_data>
<action>INSERT</action>
<brand_id>6594088</brand_id>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Good Food Made Simple</name>
</brand_data>

<brand_data>
<action>INSERT</action>
<brand_id>6523125</brand_id>
<mfr_id>105873</mfr_id>
<local_content>0</local_content>
<name>Hawaiian(Tm) Kettle Style Potato Chips</name>
</brand_data>
<brand_data>
</brands>

Yesterday I asked for assistance to extract mfr_id from the list and I used
Code:

grep mfr_id | sed -rn 's@</?mfr_id>@@gp'
to extract the data/ids which I later then sorted and removed duplicates for my actual analysis.

Today, I am looking to extract <mfr_id> and <name> from <manufacturer_data>

Issues I am having.
- sed is extracting all instances of <name>

So I need to
- tell sed to "hold" data between <manufactuer_data> tags and do pattern search to strip <mfr_id> and <name> tags and print them into columns.

This is a little above league. Can some one help me out?

Tinkster 01-20-2011 06:57 PM

Quote:

Originally Posted by bcrawl (Post 4232084)
Hey guys,

I have a huge xml file like this...
Code:

<manufacturers>

<manufacturer_data>
<action>UPDATE</action>
<mfr_id>6515951</mfr_id>
<local_content>0</local_content>
<name>Johnsonville Sausage, Llc</name>
</manufacturer_data>

<manufacturer_data>
<action>INSERT</action>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Foodmark</name>
</manufacturer_data>

</manufacturers>

<brands>

<brand_data>
<action>INSERT</action>
<brand_id>6594088</brand_id>
<mfr_id>6594084</mfr_id>
<local_content>0</local_content>
<name>Good Food Made Simple</name>
</brand_data>

<brand_data>
<action>INSERT</action>
<brand_id>6523125</brand_id>
<mfr_id>105873</mfr_id>
<local_content>0</local_content>
<name>Hawaiian(Tm) Kettle Style Potato Chips</name>
</brand_data>
<brand_data>
</brands>

Yesterday I asked for assistance to extract mfr_id from the list and I used
Code:

grep mfr_id | sed -rn 's@</?mfr_id>@@gp'
to extract the data/ids which I later then sorted and removed duplicates for my actual analysis.

Today, I am looking to extract <mfr_id> and <name> from <manufacturer_data>

Issues I am having.
- sed is extracting all instances of <name>

So I need to
- tell sed to "hold" data between <manufactuer_data> tags and do pattern search to strip <mfr_id> and <name> tags and print them into columns.

This is a little above league. Can some one help me out?


I'm sure this can be done w/ sed, but I'd use awk for this one:
Code:

awk '/<manufacturers>/,/<\/manufacturers>/{if($0~/<name>/){print gensub(/.*>([^<]+)<.*/,"\\1","1")}}' hooga.xml
Johnsonville Sausage, Llc
Foodmark

Btw, the grep statement in your solution above was superfluous.


Cheers,
Tink

grail 01-21-2011 01:34 AM

The sed looks kinda the same:
Code:

sed -rn '/<manufacturers>/,/<\/manufacturers>/s@</?name>@@pg' file

bcrawl 01-24-2011 01:57 PM

Thanks guys, both commands worked. I thought I replied to this thread but now when I was cross checking the thread I realized my response never got posted. I deeply apologize. I used awk example in this case.


All times are GMT -5. The time now is 04:15 PM.