Editing Large Text Files.

truculentknight · 11-07-2009, 01:32 PM

hey everyone,

I'm having problems trying to edit a large text file, the file is a result of software analysing the data.

The data is based upon products...heres an example!!

<item>

<title> Callaway Golf Mens RH X-Forged Chrome Approach Wedge (50 Degrees-12 Degrees Bounce) S300 (Stiff) Flex </title>

<category>Sports</category>

<pubDate>Sat, 06 Feb 2010 08:05:13 GMT</pubDate>

<link></link>

<description>

<a href=""><b>Callaway Golf Mens RH X-Forged Chrome Approach Wedge (50 Degrees-12 Degrees Bounce) S300 (Stiff) Flex</b></a><br>

<table>

<tr>

<td><a href=""><img align="left" src="http://shop.callawaygolf.com/images/products/wedges/2008/x-forged-chrome/1.jpg"> </a>

Legendary clubmaker Roger Cleveland raised the bar once again with the new X-Forged Wedges. Designed with input from Tour players, they are constructed from soft 1020 carbon steel for incredible feel. The clubs also feature a tighter heel-toe radius that provides increased versatility from anywhere around the green. </td>

</tr>

<tr>

<td>

Price: $109.00 <a href="">Buy/More Info</a>

</td>

</tr>

</table>

</description>

</item>

I would love to learn how to separate important information, I want to determine how many categories there are, I tried using grep but I couldn't get it to work.

Couldn't grep be used with a wild card to identify all of the categories within this large file? something like "<category>*</category>"

I would also like to identify products that are less then $100, how can both of these things be done?

Thanks!!

gerryd · 11-07-2009, 02:22 PM

grep displays the lines matching the pattern you indicate. for it to work here each category would have to be on a single line.

markush · 11-07-2009, 02:58 PM

Hello truculentknight and welcome to LQ,

in this case egrep should work for you:

Code:

egrep '\<category\>' *

will print every line with the category-tag.

Otherwise this looks strongly like an xml-file. I'd suggest to use a scripting language like perl which comes with a package for scanning xml-files (http://xml.coverpages.org/perl-xml-faq11.html). This will help you if you have to do something more elaborate than simply find lines in such a file.

Markus

ghostdog74 · 11-07-2009, 05:45 PM

you are using the wrong tool for the job. ideally you should be using a HTML or XML parser. But if you want to do it hardcore, use gawk

Code:

awk -vRS="</item>" '
{
 gsub(/.*Price:?/,"")
 gsub(/<.*/,"")
 print
}
' file

output

Code:

# ./shell.sh
 $109.00

See here or here for similar examples

truculentknight · 11-08-2009, 08:54 PM

WoW!! You guys are really helpful, by any chance, could someone help me figure out how to write a perl script that will help me with these large data files?

I need a perl script that can determine...

1. How many different categories there are, I need a number. And also display all the different categories on the terminal, not displaying any category more then once.
2. Extract all of the categories I specify along with the product associated with the categories (all of the xml) into a separate file.

I'm seriously not a programmer, I've used linux for years, but still I can't program, I don't think this script would be that hard? Can someone help me with it?

Thanks.

ghostdog74 · 11-08-2009, 09:27 PM

nobody is a programmer at first. All we ever did was read the docs and practice! If you want to program in Perl, read the docs and start to learn it. See my sig for Perl doc link.

chrism01 · 11-08-2009, 10:46 PM

As per ghostdog, you're best off learning how to program or you'll be forever asking qns and unable to make the most of the answers, which may a take a long time to arrive.
Start with the Perl docs as per his link, then look at search.cpan.org .
Search on XML. XML::Parser http://search.cpan.org/~msergeant/XM...2.36/Parser.pm is comprehensive, but probably overkill. Try XML::Simple http://search.cpan.org/~grantm/XML-S.../XML/Simple.pm or XML::Twig http://search.cpan.org/~mirod/XML-Twig-3.32/Twig.pm

truculentknight · 11-08-2009, 11:24 PM

hey!

I would totally love to do that, unfortunately I need the perl script to run my business, i dont have time to learn perl... haha

Instead i will just hire someone to make the script for me. Thanks anyways!