Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Digiot --thank you, but the data was all put into one single line, which would make it very hard to use given that we have a lot of products that this script is going to have to parse. --but it did clean up the output nicely, probably thanks to the sed usage.
Okay - this has nothing to do with your current problem but just to note for general purposes that it was all on one line because I only had a single sample to work with and committed a thinko there. If it worked at all (which it basically doesn't), it would be one line per item if I'd appended a newline to the caption action. '/Caption/{ printf $2"\n" }'
Also, a couple of notes - as Tinkster notes, you seem to be using mawk which is a lighter implementation of awk, but it's not like gawk is a real heavyweight compared to most other language interpreters in the first place. I think some distros like Arch used to ship mawk and then changed their minds. Perhaps Ubuntu will too.
And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.
And depending on your use for it, not all CSV is created equal. Some will want string values quoted and not numeric values, while some is literally just 'comma separated'.
But Tinkster's code works fine for me too (currently on a Debian system - which I added gawk to almost immediately upon installing) so hopefully it'll work for you.
gawk was already installed, did an apt-get install gawk and it said i already had the latest version.
It gives the exact same results as awk, and when i remove the perl bit, i still get all the same results.
Very strange. Why would the same program give you different results on your system...There's gotta be something else we're missing here...because if my awk and gawk function this differently than yours, then how could programs that rely on those two ever function correctly?
Huh. Just because you've installed gawk doesn't mean you're using gawk but the 'alternatives' mess of Debianish systems should give priority to gawk. Or if you've specifically tried both, then it's definitely not a mawk issue, either way.
It does seem to be a gensub issue, since it emits the bare quotes and apparently no error messages, but doesn't output anything that passes through gensub. I dunno what the deal is. All I can say is make sure you've specifically run 'gawk foo', because I'm not sure from your post if you did.
How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.
I had no idea that i could put gawk on the command line but actually be running something else. On your system, how do you verify what you're running?
The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?
Also, when i change the command to mawk instead of gawk or awk, i get this:
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
that's quite bizarre ... what's your locale, and what is
the files encoding?
[edit]
Quote:
originally posted byxmrkite
The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?
The fact that you didn't get gensub errors suggests
that you were using gawk in the first place. Did
something maybe go wrong in the copy & paste process
of the script?
[/edit]
Quote:
originally posted bydigiot
And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.
The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...
Cheers,
Tink
Last edited by Tinkster; 02-24-2008 at 11:44 PM.
Reason: [edit]
How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.
I had no idea that i could put gawk on the command line but actually be running something else.
Sorry I was confusing there. If you type 'gawk' into the command line then that should be what you get. And if you type 'mawk', that's what you should get.[1] I just wasn't clear on whether you'd done that or not. I wasn't sure if you were just typing 'awk', in which case it'd probably be a symlink and you'd just get whatever that pointed to.
Quote:
Originally Posted by Tinkster
The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...
Yeah, that's true. He is having problems with disappearing output, too, but that's not an awk/xml problem as such, either, (our output is correct) but some other weird issue. So I withdraw that suggestion.
---
[1] 'gawk' or 'mawk' or anything could be a symlink in turn, but I'm leaving that aside to avoid further confusion.
ok, the file encoding was the trick. It was set to ansi.
I never thought to check that. I'm not sure how to change the encoding, so i went to a terminal, did a "touch file.xml" command, opened the ansi file, copied the contents, pasted it into the file.xml file, saved, re-ran the script from Tinkster, and it worked perfectly.
It's now running on the much bigger xml file.
Thank you so much for all the help. All 3 of you had great suggestions, and of course, the code work was something that saved me hours on end had i tried to figure it out on my own. I had never used awk before, but saw this as a chance to get a start with it, and personally I learn best by examples.
I'll let you all know if i have any further problems/needs regarding this file, as it will take a few hours for the script to parse it (yeah, it's a big xml file).
Ya, we can give up on the availability if it's dynamic.
-Thanks again.
It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?
It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?
I once again don't understand ... in all the samples I've seen
the awk / sed combo produces a single line for each product.
In a shell they word-wrap, but they're still a single line.
Though it's all a single line, the text in the "Caption" column opens up into multiple cells if you open the file with open office spreadsheet. Same with MS Excel. I think this is because there are comma's in the caption column, so i think the solution is to put quotes around the entire caption column so that the spreadsheet program sees that those belong all together in one cell.
OK, I see what you mean, and that's a trivial change ... the problem is a missing
opening ", and it crept in when I stopped parsing the fields but went with digiots
regex. All you need to do is add a \" in the printf statement for caption.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.