LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 02-21-2013, 05:43 PM   #1
threezerous
Member
 
Registered: Jul 2009
Posts: 89

Rep: Reputation: 15
Retrieve results of multiple tags and separate by a delimiter to be parsed by excel


I ran a grep for a string xyz in a bunch of xml of and got results of five files as

/path1/abc1.xml: <description>xyz</description>
/path2/abc2.xml: <genre>xyz</genre>
/path3/abc3.xml: <genre>xyz</genre>
/path4/abc4.xml: <description>xyz</description>
/path5/abc5.xml: <genre>xyz</genre>

Each of these xml files has multipe tags and I need to retrieve values of two tags which are not on same lines and attach them with the respective file name.

A sample xml file looks something like

<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>

I need to retrieve values of author and title tag and associate with each file name so that I can have output something like

/path1/abc1.xml: ~<author>Gambardella, Matthew</author> ~ <title>XML Developer's Guide</title>
/path2/abc2.xml: ~<author>King, Stephen</author> ~ <title>Java Developer's Guide</title>
/path3/abc3.xml: ~<author>Hailey, Arthur</author> ~ <title>CWNA Developer's Guide</title>
... and so on
where ~ is the delimiter I put (don't care what is it is)

I am ok to read through the output of first grep results in a while loop and perl script below does give output, but puts the tags in two different line


cat /path1/abc1.xml | perl -e 'while (<>) { print $_ if ( $_ =~ /\<(author|title)\>.*\<\/(author|title)\>/ ); last if ($_ =~ /\<\/book\>/) }'
I also tried sed options, but I could get only upto first tag again.

Any help or suggestions? Thanks in advance. If somebody wishes to try this I have put sample attachments to this thread for convenience.
Attached Files
File Type: txt abc1.txt (745 Bytes, 7 views)
File Type: txt abc2.txt (1.5 KB, 6 views)
File Type: txt abc3.txt (1.4 KB, 6 views)

Last edited by threezerous; 02-21-2013 at 05:45 PM.
 
Old 02-21-2013, 07:10 PM   #2
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 3,486

Rep: Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856
My bash suggestion, to be run from the directory above path1, path2 etc.
If it looks OK, redirect output to a file.
Code:
#!/bin/bash

for file in */*.xml; do
  au=$(grep "<author>" "$file");
  ti=$(grep "<title>" "$file");
  echo "$file, $au, $ti";
done

Last edited by allend; 02-21-2013 at 07:11 PM.
 
2 members found this post helpful.
Old 02-21-2013, 07:54 PM   #3
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 3,486

Rep: Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856
If you are wanting to export to Excel, then you may want to have the text strings enclosed in double quotes.
Code:
echo '"'"$file"'", "'"$au"'", "'"$ti"'"';

Last edited by allend; 02-21-2013 at 08:02 PM.
 
1 members found this post helpful.
Old 02-22-2013, 01:29 AM   #4
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,311

Rep: Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040
I think we need some clarification; the example input has 2 authors and 2 titles, but the 'desired' output only has one for each file.
 
Old 02-22-2013, 10:02 AM   #5
threezerous
Member
 
Registered: Jul 2009
Posts: 89

Original Poster
Rep: Reputation: 15
Chris,

You are right. The desired output needs the first occurence of each tag. Should have been specific. Going to try Allend's suggestion now. Thanks for reading through the long question and your suggestions.
 
Old 02-24-2013, 08:20 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950
Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.


Line and regex based tools like grep/sed/awk are not well suited for parsing xml's nested, free-form structure. It's usually better to use a tool that has a dedicated xml parser.

Perl, which you mentioned, has xml modules available, but I'm not that familiar with it. I can only show you the one I know, which is xmlstarlet.

I got the following to work on the above example (after closing out the catalog tag):

Code:
$ xmlstarlet sel -T -t -f -v 'concat(":<author>",//book[1]/author,"</author>~")' -v 'concat("<title>",//book[1]/title,"</title>")' -n infile.xml
infile.xml:<author>Gambardella, Matthew</author>~<title>XML Developer's Guide</title>
To break it down, sel is the command for extraction. -T outputs plain text, and -t starts the template command. Inside the template, -f prints the filename, the two -v commands print the extracted values, and -n adds a newline at the end.

concat is an xpath function that combines text strings together. "//book[1]/author" extracts the value of the author tag inside the first book tag. Same goes with the title. The text strings on either side reconstruct the tag brackets and the delimiters around them.

There may be a way to print the whole entry directly, but I'm not familiar enough with it myself to know how. Also, xmlstarlet insists on well-formed xml input, so you may need to clean up the formatting first.

Or as another option. try using the pyx command, which converts the xml into a line-based representation that you can more safely parse with sed or awk.

There's also a tool in the html-xml-utils package called hxpipe which can print out a similar line-based format, and it's a bit more robust on the input it can handle.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How use CUT -d 'delimiter' is delimiter is a TAB? frenchn00b Programming 12 11-06-2013 04:17 AM
Can you echo results to an excel file graphicsmanx1 Programming 22 11-20-2012 03:49 AM
delimiter separate string to array ted_chou12 Linux - Newbie 7 11-11-2011 06:31 AM
how to cat multiple files into a single file with tab delimiter shyamsandeep Linux - Newbie 2 09-06-2011 12:30 PM
doc automation : retrieve information from microsoft word form or excel with scripts gink_oh Linux - Newbie 1 11-17-2008 06:29 AM


All times are GMT -5. The time now is 02:25 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration