LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-24-2008, 11:02 PM   #16
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Rep: Reputation: Disabled

Glad I was any help at all.

Quote:
Originally Posted by xmrkite View Post
Digiot --thank you, but the data was all put into one single line, which would make it very hard to use given that we have a lot of products that this script is going to have to parse. --but it did clean up the output nicely, probably thanks to the sed usage.
Okay - this has nothing to do with your current problem but just to note for general purposes that it was all on one line because I only had a single sample to work with and committed a thinko there. If it worked at all (which it basically doesn't), it would be one line per item if I'd appended a newline to the caption action. '/Caption/{ printf $2"\n" }'

Also, a couple of notes - as Tinkster notes, you seem to be using mawk which is a lighter implementation of awk, but it's not like gawk is a real heavyweight compared to most other language interpreters in the first place. I think some distros like Arch used to ship mawk and then changed their minds. Perhaps Ubuntu will too.

And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.

And depending on your use for it, not all CSV is created equal. Some will want string values quoted and not numeric values, while some is literally just 'comma separated'.

But Tinkster's code works fine for me too (currently on a Debian system - which I added gawk to almost immediately upon installing) so hopefully it'll work for you.
 
Old 02-24-2008, 11:03 PM   #17
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
gawk was already installed, did an apt-get install gawk and it said i already had the latest version.

It gives the exact same results as awk, and when i remove the perl bit, i still get all the same results.

Very strange. Why would the same program give you different results on your system...There's gotta be something else we're missing here...because if my awk and gawk function this differently than yours, then how could programs that rely on those two ever function correctly?
 
Old 02-24-2008, 11:15 PM   #18
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Rep: Reputation: Disabled
Huh. Just because you've installed gawk doesn't mean you're using gawk but the 'alternatives' mess of Debianish systems should give priority to gawk. Or if you've specifically tried both, then it's definitely not a mawk issue, either way.

It does seem to be a gensub issue, since it emits the bare quotes and apparently no error messages, but doesn't output anything that passes through gensub. I dunno what the deal is. All I can say is make sure you've specifically run 'gawk foo', because I'm not sure from your post if you did.
 
Old 02-24-2008, 11:29 PM   #19
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.

I had no idea that i could put gawk on the command line but actually be running something else. On your system, how do you verify what you're running?

The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?

-Thanks
 
Old 02-24-2008, 11:31 PM   #20
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
Also, when i change the command to mawk instead of gawk or awk, i get this:

mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
 
Old 02-24-2008, 11:41 PM   #21
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
xmrkite,

that's quite bizarre ... what's your locale, and what is
the files encoding?

[edit]
Quote:
originally posted byxmrkite
The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?
The fact that you didn't get gensub errors suggests
that you were using gawk in the first place. Did
something maybe go wrong in the copy & paste process
of the script?
[/edit]

Quote:
originally posted bydigiot
And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.
The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...


Cheers,
Tink

Last edited by Tinkster; 02-24-2008 at 11:44 PM. Reason: [edit]
 
Old 02-25-2008, 12:01 AM   #22
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Rep: Reputation: Disabled
Quote:
Originally Posted by xmrkite View Post
How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.

I had no idea that i could put gawk on the command line but actually be running something else.
Sorry I was confusing there. If you type 'gawk' into the command line then that should be what you get. And if you type 'mawk', that's what you should get.[1] I just wasn't clear on whether you'd done that or not. I wasn't sure if you were just typing 'awk', in which case it'd probably be a symlink and you'd just get whatever that pointed to.

Quote:
Originally Posted by Tinkster View Post
The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...
Yeah, that's true. He is having problems with disappearing output, too, but that's not an awk/xml problem as such, either, (our output is correct) but some other weird issue. So I withdraw that suggestion.

---
[1] 'gawk' or 'mawk' or anything could be a symlink in turn, but I'm leaving that aside to avoid further confusion.
 
Old 02-25-2008, 01:01 AM   #23
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
You should be able to see what awk variants you've got by doing these:

awk --version
gawk --version
mawk --version
nawk --version

and post the results
 
Old 02-25-2008, 01:26 AM   #24
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
ok, the file encoding was the trick. It was set to ansi.

I never thought to check that. I'm not sure how to change the encoding, so i went to a terminal, did a "touch file.xml" command, opened the ansi file, copied the contents, pasted it into the file.xml file, saved, re-ran the script from Tinkster, and it worked perfectly.

It's now running on the much bigger xml file.

Thank you so much for all the help. All 3 of you had great suggestions, and of course, the code work was something that saved me hours on end had i tried to figure it out on my own. I had never used awk before, but saw this as a chance to get a start with it, and personally I learn best by examples.

I'll let you all know if i have any further problems/needs regarding this file, as it will take a few hours for the script to parse it (yeah, it's a big xml file).

-Thanks again.
 
Old 02-25-2008, 02:27 AM   #25
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Well that's good news, glad I could help. So you're happy with my
decision to give up on the availability where it's a dynamic field?



Cheers,
Tink
 
Old 02-25-2008, 12:03 PM   #26
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
Ya, we can give up on the availability if it's dynamic.

-Thanks again.

It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?
 
Old 02-25-2008, 04:23 PM   #27
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by xmrkite View Post
It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?
I once again don't understand ... in all the samples I've seen
the awk / sed combo produces a single line for each product.

In a shell they word-wrap, but they're still a single line.



Cheers,
Tink
 
Old 02-25-2008, 04:37 PM   #28
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
Though it's all a single line, the text in the "Caption" column opens up into multiple cells if you open the file with open office spreadsheet. Same with MS Excel. I think this is because there are comma's in the caption column, so i think the solution is to put quotes around the entire caption column so that the spreadsheet program sees that those belong all together in one cell.
 
Old 02-25-2008, 05:04 PM   #29
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
OK, I see what you mean, and that's a trivial change ... the problem is a missing
opening ", and it crept in when I stopped parsing the fields but went with digiots
regex. All you need to do is add a \" in the printf statement for caption.
Code:
/<Caption>/ {
  printf( "%s, ", strip( gensub( /.+Caption>([^<]+).*/, "\\1","g")))
  printf "\"\n"
}
can become
Code:
/<Caption>/ {
  printf( "\"%s\"\n ", strip( gensub( /.+Caption>([^<]+).*/, "\\1","g")))
}
Sorry for that - must have been getting tired or something ;}


Cheers,
Tink
 
Old 02-26-2008, 10:38 AM   #30
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 554

Original Poster
Rep: Reputation: 30
Tink, thanks for the code, that worked great. I'm starting to understand it a little better now.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract spesific text from an HTML file mister_0101 Programming 6 07-24-2005 04:50 PM
Getting info from text file alts Programming 16 11-19-2004 01:03 AM
Extract text from a html file gsphanikumar6 Linux - Newbie 2 08-20-2004 01:11 PM
PHP & MySQL getting info from text file neon Programming 1 10-15-2003 12:34 AM
linux shell - extract filename from and song info from text database d003 Programming 1 07-23-2003 04:06 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:44 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration