[SOLVED] Retrieve results of multiple tags and separate by a delimiter to be parsed by excel

threezerous · 02-21-2013, 04:43 PM

I ran a grep for a string xyz in a bunch of xml of and got results of five files as

/path1/abc1.xml: <description>xyz</description>
/path2/abc2.xml: <genre>xyz</genre>
/path3/abc3.xml: <genre>xyz</genre>
/path4/abc4.xml: <description>xyz</description>
/path5/abc5.xml: <genre>xyz</genre>

Each of these xml files has multipe tags and I need to retrieve values of two tags which are not on same lines and attach them with the respective file name.

A sample xml file looks something like

<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>

I need to retrieve values of author and title tag and associate with each file name so that I can have output something like

/path1/abc1.xml: ~<author>Gambardella, Matthew</author> ~ <title>XML Developer's Guide</title>
/path2/abc2.xml: ~<author>King, Stephen</author> ~ <title>Java Developer's Guide</title>
/path3/abc3.xml: ~<author>Hailey, Arthur</author> ~ <title>CWNA Developer's Guide</title>
... and so on
where ~ is the delimiter I put (don't care what is it is)

I am ok to read through the output of first grep results in a while loop and perl script below does give output, but puts the tags in two different line

cat /path1/abc1.xml | perl -e 'while (<>) { print $_ if ( $_ =~ /\<(author|title)\>.*\<\/(author|title)\>/ ); last if ($_ =~ /\<\/book\>/) }'
I also tried sed options, but I could get only upto first tag again.

Any help or suggestions? Thanks in advance. If somebody wishes to try this I have put sample attachments to this thread for convenience.

allend · 02-21-2013, 06:10 PM

My bash suggestion, to be run from the directory above path1, path2 etc.
If it looks OK, redirect output to a file.

Code:

#!/bin/bash

for file in */*.xml; do
  au=$(grep "<author>" "$file");
  ti=$(grep "<title>" "$file");
  echo "$file, $au, $ti";
done

allend · 02-21-2013, 06:54 PM

If you are wanting to export to Excel, then you may want to have the text strings enclosed in double quotes.

Code:

echo '"'"$file"'", "'"$au"'", "'"$ti"'"';

chrism01 · 02-22-2013, 12:29 AM

I think we need some clarification; the example input has 2 authors and 2 titles, but the 'desired' output only has one for each file.

threezerous · 02-22-2013, 09:02 AM

Chris,

You are right. The desired output needs the first occurence of each tag. Should have been specific. Going to try Allend's suggestion now. Thanks for reading through the long question and your suggestions.

David the H. · 02-24-2013, 07:20 PM

Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.

Line and regex based tools like grep/sed/awk are not well suited for parsing xml's nested, free-form structure. It's usually better to use a tool that has a dedicated xml parser.

Perl, which you mentioned, has xml modules available, but I'm not that familiar with it. I can only show you the one I know, which is xmlstarlet.

I got the following to work on the above example (after closing out the catalog tag):

Code:

$ xmlstarlet sel -T -t -f -v 'concat(":<author>",//book[1]/author,"</author>~")' -v 'concat("<title>",//book[1]/title,"</title>")' -n infile.xml
infile.xml:<author>Gambardella, Matthew</author>~<title>XML Developer's Guide</title>

To break it down, sel is the command for extraction. -T outputs plain text, and -t starts the template command. Inside the template, -f prints the filename, the two -v commands print the extracted values, and -n adds a newline at the end.

concat is an xpath function that combines text strings together. "//book[1]/author" extracts the value of the author tag inside the first book tag. Same goes with the title. The text strings on either side reconstruct the tag brackets and the delimiters around them.

There may be a way to print the whole entry directly, but I'm not familiar enough with it myself to know how. Also, xmlstarlet insists on well-formed xml input, so you may need to clean up the formatting first.

Or as another option. try using the pyx command, which converts the xml into a line-based representation that you can more safely parse with sed or awk.

There's also a tool in the html-xml-utils package called hxpipe which can print out a similar line-based format, and it's a bit more robust on the input it can handle.