Extract certain text info from text file

slakmagik · 02-24-2008, 11:02 PM

Glad I was any help at all.

Quote:

Originally Posted by xmrkite

Digiot --thank you, but the data was all put into one single line, which would make it very hard to use given that we have a lot of products that this script is going to have to parse. --but it did clean up the output nicely, probably thanks to the sed usage.

Okay - this has nothing to do with your current problem but just to note for general purposes that it was all on one line because I only had a single sample to work with and committed a thinko there. If it worked at all (which it basically doesn't), it would be one line per item if I'd appended a newline to the caption action. '/Caption/{ printf $2"\n" }'

Also, a couple of notes - as Tinkster notes, you seem to be using mawk which is a lighter implementation of awk, but it's not like gawk is a real heavyweight compared to most other language interpreters in the first place.

I think some distros like Arch used to ship mawk and then changed their minds. Perhaps Ubuntu will too.

And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.

And depending on your use for it, not all CSV is created equal. Some will want string values quoted and not numeric values, while some is literally just 'comma separated'.

But Tinkster's code works fine for me too (currently on a Debian system - which I added gawk to almost immediately upon installing) so hopefully it'll work for you.

xmrkite · 02-24-2008, 11:03 PM

gawk was already installed, did an apt-get install gawk and it said i already had the latest version.

It gives the exact same results as awk, and when i remove the perl bit, i still get all the same results.

Very strange. Why would the same program give you different results on your system...There's gotta be something else we're missing here...because if my awk and gawk function this differently than yours, then how could programs that rely on those two ever function correctly?

slakmagik · 02-24-2008, 11:15 PM

Huh. Just because you've installed gawk doesn't mean you're using gawk but the 'alternatives' mess of Debianish systems should give priority to gawk. Or if you've specifically tried both, then it's definitely not a mawk issue, either way.

It does seem to be a gensub issue, since it emits the bare quotes and apparently no error messages, but doesn't output anything that passes through gensub. I dunno what the deal is. All I can say is make sure you've specifically run 'gawk foo', because I'm not sure from your post if you did.

xmrkite · 02-24-2008, 11:29 PM

How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.

I had no idea that i could put gawk on the command line but actually be running something else. On your system, how do you verify what you're running?

The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?

-Thanks

xmrkite · 02-24-2008, 11:31 PM

Also, when i change the command to mawk instead of gawk or awk, i get this:

mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined
mawk: test.awk: line 26: function gensub never defined

Tinkster · 02-24-2008, 11:41 PM

xmrkite,

that's quite bizarre ... what's your locale, and what is
the files encoding?

[edit]

Quote:

originally posted byxmrkite
The alternative is: could Tinkster's code be somehow converted or adapted to work on my ubuntu "true awk" deprived system?

The fact that you didn't get gensub errors suggests
that you were using gawk in the first place. Did
something maybe go wrong in the copy & paste process
of the script?
[/edit]

Quote:

originally posted bydigiot
And, given the weirdness of the input and the fact that you want csv data, I can't believe there's not some dedicated xml2csv tool out there - or that using an xml parser in some fashion might not be more targeted. If not that, there is an 'xmlgawk' out there which is supposed to be 'gawk with xml extensions' to make this sort of job easier, but I've never used it.

The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...

Cheers,
Tink

slakmagik · 02-25-2008, 12:01 AM

Quote:

Originally Posted by xmrkite

How would i specifically run "gawk foo" then? I'm still rather new to gawk/awk/etc, so i'm not sure how to tell what i'm running, other than the fact that i put gawk or awk on the command line.

I had no idea that i could put gawk on the command line but actually be running something else.

Sorry I was confusing there. If you type 'gawk' into the command line then that should be what you get. And if you type 'mawk', that's what you should get.[1] I just wasn't clear on whether you'd done that or not. I wasn't sure if you were just typing 'awk', in which case it'd probably be a symlink and you'd just get whatever that pointed to.

Quote:

Originally Posted by Tinkster

The issue with his XML is not the XML part as such, awk
is perfectly capable of dealing with that ... the weird
stuff is the HTML markup crammed inside it, and that it's
inconsistent ... plain text for some items, dynamic mark-up
for other bits ...

Yeah, that's true. He is having problems with disappearing output, too, but that's not an awk/xml problem as such, either, (our output is correct) but some other weird issue. So I withdraw that suggestion.

---
[1] 'gawk' or 'mawk' or anything could be a symlink in turn, but I'm leaving that aside to avoid further confusion.

chrism01 · 02-25-2008, 01:01 AM

You should be able to see what awk variants you've got by doing these:

awk --version
gawk --version
mawk --version
nawk --version

and post the results

xmrkite · 02-25-2008, 01:26 AM

ok, the file encoding was the trick. It was set to ansi.

I never thought to check that. I'm not sure how to change the encoding, so i went to a terminal, did a "touch file.xml" command, opened the ansi file, copied the contents, pasted it into the file.xml file, saved, re-ran the script from Tinkster, and it worked perfectly.

It's now running on the much bigger xml file.

Thank you so much for all the help. All 3 of you had great suggestions, and of course, the code work was something that saved me hours on end had i tried to figure it out on my own. I had never used awk before, but saw this as a chance to get a start with it, and personally I learn best by examples.

I'll let you all know if i have any further problems/needs regarding this file, as it will take a few hours for the script to parse it (yeah, it's a big xml file).

-Thanks again.

Tinkster · 02-25-2008, 02:27 AM

Well that's good news, glad I could help. So you're happy with my
decision to give up on the availability where it's a dynamic field?

Cheers,
Tink

xmrkite · 02-25-2008, 12:03 PM

Ya, we can give up on the availability if it's dynamic.

-Thanks again.

It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?

Tinkster · 02-25-2008, 04:23 PM

Quote:

Originally Posted by xmrkite

It all looks great, but the only thing i'd like to fix is that the caption does not go into one field in the csv due to the fact that there are commas in the caption. So how can i put quotes around that field so that they all stay in the same field?

I once again don't understand ... in all the samples I've seen
the awk / sed combo produces a single line for each product.

In a shell they word-wrap, but they're still a single line.

Cheers,
Tink

xmrkite · 02-25-2008, 04:37 PM

Though it's all a single line, the text in the "Caption" column opens up into multiple cells if you open the file with open office spreadsheet. Same with MS Excel. I think this is because there are comma's in the caption column, so i think the solution is to put quotes around the entire caption column so that the spreadsheet program sees that those belong all together in one cell.

Tinkster · 02-25-2008, 05:04 PM

OK, I see what you mean, and that's a trivial change ... the problem is a missing
opening ", and it crept in when I stopped parsing the fields but went with digiots
regex. All you need to do is add a \" in the printf statement for caption.

Code:

/<Caption>/ {
  printf( "%s, ", strip( gensub( /.+Caption>([^<]+).*/, "\\1","g")))
  printf "\"\n"
}

can become

Code:

/<Caption>/ {
  printf( "\"%s\"\n ", strip( gensub( /.+Caption>([^<]+).*/, "\\1","g")))
}

Sorry for that - must have been getting tired or something ;}

Cheers,
Tink

xmrkite · 02-26-2008, 10:38 AM

Tink, thanks for the code, that worked great. I'm starting to understand it a little better now.