XML to CSV with a resursive element to it

neverton · 06-05-2012, 07:42 AM

Hi,
I've been using Linux for many years, but my scripting knowledge is pretty much useless (I code using C and/or Java for most of my needs)

I've got a project where I'm being asked to convert a file from XML to CSV then process the CSV file, etc.

I've googled about awk, sed and perl but I thought before I give it much more time I'd ask for some advice to see if its worth my going down these routes.

Here's a sample of the XML file ;
<OrderMessage>
<Header>
<TerminalID>TEST123</TerminalID>
<ProductCodeType>PIP</ProductCodeType>
<OrderType>N</OrderType>
<AccountNumber>012345</AccountNumber>
<DateTimeSent>2007-09-12T09:24:48</DateTimeSent>
</Header>
<Order>
<Line>
<LineNumber>1</LineNumber>
<ProductCode>1003763</ProductCode>
<PackSize>1</PackSize>
<Quantity>1</Quantity>
<StickerPrice>0</StickerPrice>
</Line>
<Line>
<LineNumber>2</LineNumber>
<ProductCode>2551661</ProductCode>
<PackSize>1</PackSize>
<Quantity>1</Quantity>
<StickerPrice>0</StickerPrice>
</Line>
</Order>
</OrderMessage>

The only elements I'm interested in are;
[AccountNumber]
and then
[ProductCode],[Quantity]
[ProductCode],[Quantity]
...

If I can get it to output AccNo=[AccountNumber] then all the better (saves me another job)

All the scripts I've seen don't seem to allow for retreiving multiple elements within a section, so I've got a script to retreive the [ProductCode], or the [Quantity] fields, but I've not seen anything to retrieve [ProductCode],[Quantity]

So basically, can I do this via perl,awk or should I just code something in C and stop looking into scripts?

Regards
Neil

markush · 06-05-2012, 08:04 AM

Hi, welcome to LQ,

I'd recommend to use Perl instead of awk. There are modules for Perl to handle XML, you can find such modules at cpan http://search.cpan.org/search?query=xml&mode=all

If you simply want to extract the desired data from an XML file, you could use regular expressions. This is relatively simple with Perl.

I hope that gets you in the right direction. If you want to get more help, please post the code which you've written so far.

Markus

BTW: I've reported this thread and asked it to be moved to the Programming-Forum.

pan64 · 06-05-2012, 08:18 AM

using awk the trick can be: to set a record separator to fit your needs. As I see <line> looks good (if you do not need the header part)
Use all the non-required chars as separator, so set it as: "</>\n"
next, you will have fields, $1 is linenumber, $2 its value, $3 linenumber again, $4 productcode ....
finally you can format the output as you like.

Code:

awk ' BEGIN { RS="<line>"; FS="[</>\n]*"; { printf "%s=%s, %s=%s, %s=%s\n", ..... } ' filename

this is not tested. The first record will contain account number, you need to handle it differently. something like:

Code:

/AccountNumber/ { printf "....", ....; next }
{ printf "%s=%s, %s=%s, %s=%s\n", .....; }                 ## <- this is the line handling records without account nnumbers

neverton · 06-05-2012, 08:42 AM

Hi Markush and pan64,
Thanks for both the replies, I'll look into both of them, as I said my coding is pretty much non-existant in awk and perl, so I'll look for some examples and start there.

markush, yeah I noticed there was a programming section after I'd hit submit, I put it in here as other posts on a similar subject where in here as well, my bad.

Regards
Neil

neverton · 06-05-2012, 10:17 AM

Hi,
I've sorted it, in a very basic sense, but it works for my needs.

Wrote a small script which parses the file twice, once for the account number, then again for the line details.

Here's the content of the script;

echo -n "AccNo=" > test.csv
awk 'BEGIN{RS = "[<>]"}{ORS = /Header/?"\n":""}/Header/,/\/item/{if(/^(AccountNumber)$/ && getline)print}' taylors.xml >> test.csv
echo "" >> test.csv
awk 'BEGIN{RS = "[<>]"}{ORS = /Quantity/?"\n":","}/ProductCode/,/\/item/{if(/^(ProductCode|Quantity)$/ && getline)print}' taylors.xml >> test.csv

First line creates a file with the AccNo wording I need.
Second line parses the file and strips out the accountnumber section
Third line puts a newline after the accountnumber
Fourth line parses and appends the line and quantity details

Thanks to markush and pan64 for your help

Regards
Neil

pan64 · 06-05-2012, 10:39 AM

That is great, I have a much clearer solution:

Code:

awk 'BEGIN { FS = "[<>/]"} /AccountNumber/ { printf "AccNo=%s\n", $3 }      # if you want an additional line, write "AccNo=%s\n\n"                     
                           /ProductCode>/  { printf "%s,", $3 }
                           /Quantity/      { print $3 }               ' taylors.xml > test.csv

__________________________________
Happy with solution ... mark as SOLVED
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.

neverton · 06-05-2012, 10:43 AM

pan64,
That's a genius solution, much slicker than my fudge-up

Many thanks
Neil

theNbomr · 06-07-2012, 09:23 AM

Please note that simple-minded XML parsers are prone to breakage if the 'visual' format of the XML data changes. It is quite valid XML to put the entire file on one line, or to otherwise insert or remove whitespace in many ways. A proper XML parser will be indifferent to this kind of formatting. The following are all equivalent XML:

Code:

<AccountNumber>012345</AccountNumber>

Code:

<AccountNumber>
012345
</AccountNumber>

Code:

<AccountNumber>
    012345
</AccountNumber>

Code:

<AccountNumber>
012345</AccountNumber>

Code:

<AccountNumber>


012345   </AccountNumber>

There can be even more severe differences. The possibility of these is one very good reason to use a standard-compliant tested XML parser module/library. Most common programming languages have bindings to at least a couple of different styles of XML parsers.
--- rod.

ntubski · 06-07-2012, 02:06 PM

Solution using xmlstaret:

Code:

xml sel -T -t -o AccNo= -v /OrderMessage/Header/AccountNumber -n \
  -m /OrderMessage/Order/Line -v ProductCode -o , -v Quantity -n \
  order.xml

@theNbomr: actually whitespace inside a node is significant; unless you pass it a schema saying otherwise, the parser has to assume the node contains arbitrary text.

theNbomr · 06-07-2012, 07:58 PM

Quote:

Originally Posted by ntubski

@theNbomr: actually whitespace inside a node is significant; unless you pass it a schema saying otherwise, the parser has to assume the node contains arbitrary text.

I stand corrected on that, then. However, I think it underscores the importance of a proper XML parser. If what you say is true, I think I will have to investigate how well some of my favorite standards do, in that respect.

--- rod.