LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-05-2012, 07:42 AM   #1
neverton
LQ Newbie
 
Registered: Jun 2012
Posts: 4

Rep: Reputation: Disabled
XML to CSV with a resursive element to it


Hi,
I've been using Linux for many years, but my scripting knowledge is pretty much useless (I code using C and/or Java for most of my needs)

I've got a project where I'm being asked to convert a file from XML to CSV then process the CSV file, etc.

I've googled about awk, sed and perl but I thought before I give it much more time I'd ask for some advice to see if its worth my going down these routes.

Here's a sample of the XML file ;
<OrderMessage>
<Header>
<TerminalID>TEST123</TerminalID>
<ProductCodeType>PIP</ProductCodeType>
<OrderType>N</OrderType>
<AccountNumber>012345</AccountNumber>
<DateTimeSent>2007-09-12T09:24:48</DateTimeSent>
</Header>
<Order>
<Line>
<LineNumber>1</LineNumber>
<ProductCode>1003763</ProductCode>
<PackSize>1</PackSize>
<Quantity>1</Quantity>
<StickerPrice>0</StickerPrice>
</Line>
<Line>
<LineNumber>2</LineNumber>
<ProductCode>2551661</ProductCode>
<PackSize>1</PackSize>
<Quantity>1</Quantity>
<StickerPrice>0</StickerPrice>
</Line>
</Order>
</OrderMessage>

The only elements I'm interested in are;
[AccountNumber]
and then
[ProductCode],[Quantity]
[ProductCode],[Quantity]
...

If I can get it to output AccNo=[AccountNumber] then all the better (saves me another job)

All the scripts I've seen don't seem to allow for retreiving multiple elements within a section, so I've got a script to retreive the [ProductCode], or the [Quantity] fields, but I've not seen anything to retrieve [ProductCode],[Quantity]

So basically, can I do this via perl,awk or should I just code something in C and stop looking into scripts?

Regards
Neil
 
Old 06-05-2012, 08:04 AM   #2
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Hi, welcome to LQ,

I'd recommend to use Perl instead of awk. There are modules for Perl to handle XML, you can find such modules at cpan http://search.cpan.org/search?query=xml&mode=all

If you simply want to extract the desired data from an XML file, you could use regular expressions. This is relatively simple with Perl.

I hope that gets you in the right direction. If you want to get more help, please post the code which you've written so far.

Markus

BTW: I've reported this thread and asked it to be moved to the Programming-Forum.

Last edited by markush; 06-05-2012 at 08:09 AM.
 
Old 06-05-2012, 08:18 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
using awk the trick can be: to set a record separator to fit your needs. As I see <line> looks good (if you do not need the header part)
Use all the non-required chars as separator, so set it as: "</>\n"
next, you will have fields, $1 is linenumber, $2 its value, $3 linenumber again, $4 productcode ....
finally you can format the output as you like.
Code:
awk ' BEGIN { RS="<line>"; FS="[</>\n]*"; { printf "%s=%s, %s=%s, %s=%s\n", ..... } ' filename
this is not tested. The first record will contain account number, you need to handle it differently. something like:
Code:
/AccountNumber/ { printf "....", ....; next }
{ printf "%s=%s, %s=%s, %s=%s\n", .....; }                 ## <- this is the line handling records without account nnumbers
 
Old 06-05-2012, 08:42 AM   #4
neverton
LQ Newbie
 
Registered: Jun 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi Markush and pan64,
Thanks for both the replies, I'll look into both of them, as I said my coding is pretty much non-existant in awk and perl, so I'll look for some examples and start there.

markush, yeah I noticed there was a programming section after I'd hit submit, I put it in here as other posts on a similar subject where in here as well, my bad.

Regards
Neil
 
Old 06-05-2012, 10:17 AM   #5
neverton
LQ Newbie
 
Registered: Jun 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi,
I've sorted it, in a very basic sense, but it works for my needs.

Wrote a small script which parses the file twice, once for the account number, then again for the line details.

Here's the content of the script;

echo -n "AccNo=" > test.csv
awk 'BEGIN{RS = "[<>]"}{ORS = /Header/?"\n":""}/Header/,/\/item/{if(/^(AccountNumber)$/ && getline)print}' taylors.xml >> test.csv
echo "" >> test.csv
awk 'BEGIN{RS = "[<>]"}{ORS = /Quantity/?"\n":","}/ProductCode/,/\/item/{if(/^(ProductCode|Quantity)$/ && getline)print}' taylors.xml >> test.csv

First line creates a file with the AccNo wording I need.
Second line parses the file and strips out the accountnumber section
Third line puts a newline after the accountnumber
Fourth line parses and appends the line and quantity details

Thanks to markush and pan64 for your help

Regards
Neil
 
Old 06-05-2012, 10:39 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
That is great, I have a much clearer solution:

Code:
awk 'BEGIN { FS = "[<>/]"} /AccountNumber/ { printf "AccNo=%s\n", $3 }      # if you want an additional line, write "AccNo=%s\n\n"                     
                           /ProductCode>/  { printf "%s,", $3 }
                           /Quantity/      { print $3 }               ' taylors.xml > test.csv




__________________________________
Happy with solution ... mark as SOLVED
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.
 
Old 06-05-2012, 10:43 AM   #7
neverton
LQ Newbie
 
Registered: Jun 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
pan64,
That's a genius solution, much slicker than my fudge-up

Many thanks
Neil
 
Old 06-07-2012, 09:23 AM   #8
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Please note that simple-minded XML parsers are prone to breakage if the 'visual' format of the XML data changes. It is quite valid XML to put the entire file on one line, or to otherwise insert or remove whitespace in many ways. A proper XML parser will be indifferent to this kind of formatting. The following are all equivalent XML:
Code:
<AccountNumber>012345</AccountNumber>
Code:
<AccountNumber>
012345
</AccountNumber>
Code:
<AccountNumber>
    012345
</AccountNumber>
Code:
<AccountNumber>
012345</AccountNumber>
Code:
<AccountNumber>


012345   </AccountNumber>
There can be even more severe differences. The possibility of these is one very good reason to use a standard-compliant tested XML parser module/library. Most common programming languages have bindings to at least a couple of different styles of XML parsers.
--- rod.

Last edited by theNbomr; 06-07-2012 at 09:27 AM.
 
1 members found this post helpful.
Old 06-07-2012, 02:06 PM   #9
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Solution using xmlstaret:
Code:
xml sel -T -t -o AccNo= -v /OrderMessage/Header/AccountNumber -n \
  -m /OrderMessage/Order/Line -v ProductCode -o , -v Quantity -n \
  order.xml
@theNbomr: actually whitespace inside a node is significant; unless you pass it a schema saying otherwise, the parser has to assume the node contains arbitrary text.
 
Old 06-07-2012, 07:58 PM   #10
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Quote:
Originally Posted by ntubski View Post
@theNbomr: actually whitespace inside a node is significant; unless you pass it a schema saying otherwise, the parser has to assume the node contains arbitrary text.
I stand corrected on that, then. However, I think it underscores the importance of a proper XML parser. If what you say is true, I think I will have to investigate how well some of my favorite standards do, in that respect.

--- rod.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Awk xml convert to csv frambau Linux - Newbie 15 02-09-2012 07:44 AM
XML to CSV grob115 Programming 11 04-12-2010 11:58 AM
XML Schema element references Omni Programming 0 09-22-2006 02:07 PM
XML question multiple parents for child element pld Programming 1 03-17-2005 02:14 PM
PHP - XML error: junk after document element at line... jimieee Programming 2 04-27-2004 11:16 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration