LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 02-22-2008, 05:07 PM   #1
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Rep: Reputation: 30
Extract certain text info from text file


Hello, i have a large text file that has certain bits of info...Not every line has info i need, but on the lines that do have that info, I want to extract all that info.

Here's the idea:

The text file contains these lines:
Code:
Item 1
age=(27 days)
random text random text random ID=(2701)
random line that i don't want any text from
random line that i don't want any text from
Item 2
age=(2 days)
random line that i don't want any text from
random line that i don't want any text from
random text random text random ID=(2708 zet)
random line that i don't want any text from
random line that i don't want any text from

So what i want to get out of that file is a csv that looks like this:

Code:
Item 1, 27 days, 2701
Item 2, 2 days, 2708 zet

The problem is that some of the items look something like this:

Code:
Item 3
age=(3 days)
random text random text random ID=(333) ID=(445 zt) ID=(dft 435 988)

So the line in the csv for that one should be

Code:
Item 3, 3 days, "333, 445 zt, dft 435 988"
I put the random text part in there because the lines with the ID on it begins with different text each time, whereas the ones with Item and age start with Item and age.

Any ideas on how to clean this up?

-Thanks
 
Old 02-22-2008, 06:41 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
A bit of awk ... seems to work with your sample data :}

Code:
/^Item/ {
  printf ("%s, ",$0)
}
/age=/ {
  printf( "%s, ", gensub( /age=\((.+)\)/, "\\1","g"))
}
/ID=/ {
  count=split($0, a, /ID=/);
  printf "\""
  for(i=1;i<=count;i++){
    comma=""
    if( i > 1 && i < count){ comma=","}
    if ( a[i] ~ /[\(\)]/ ) {
      printf ( "%s%s ", gensub(/[\(\)]/, "", "G", strip(a[i])), comma)
    }
  }
  printf "\"\n"
}
function strip(string){
  value=gensub( /^ +(.*)/, "\\1","1",string)
  value=gensub( /(.*) +$/, "\\1","1",string)
  return value
}
Code:
$ cat split
Item 1
age=(27 days)
random text random text random ID=(2701)
random line that i don't want any text from
random line that i don't want any text from
Item 2
age=(2 days)
random line that i don't want any text from
random line that i don't want any text from
random text random text random ID=(2708 zet)
random line that i don't want any text from
random line that i don't want any text from
Item 3
age=(3 days)
random text random text random ID=(333) ID=(445 zt) ID=(dft 435 988)

$ awk -f awk_script split 
Item 1, 27 days, "2701 "
Item 2, 2 days, "2708 zet "
Item 3, 3 days, "333, 445 zt, dft 435 988 "


Cheers,
Tink
 
Old 02-23-2008, 11:19 AM   #3
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
So how do i run this then? If my filename is example.txt, what is the command i run, assuming that i put your code into file abc.txt?

Just not sure how to apply what you gave me

-thanks
 
Old 02-23-2008, 11:46 AM   #4
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
OK, hold off on the response to that one, i think i got it...i'll post back when i'm successful. Thank you
 
Old 02-23-2008, 12:06 PM   #5
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
OK, i admit, i'm a total noob when it comes to awk. Never used it before. So i thought i could take your example and modify it to suit my real file, but i am just lost. Below is part of the actual file i'm trying to get my info from:

Code:
    <Product Id="02k6555">
      <Code>02K6555</Code>
      <Description>IBM Thinkpad Item</Description>
      <Url>http://www.company.com/02k6555.html</Url>
      <Pricing>
        <BasePrice>12.95</BasePrice>
        <LocalizedBasePrice>9.95</LocalizedBasePrice>
        <OrigPrice>16.42</OrigPrice>
        <LocalizedOrigPrice>16.42</LocalizedOrigPrice>
        <SalePrice>11.95</SalePrice>
        <LocalizedSalePrice>11.95</LocalizedSalePrice>
      </Pricing>
      <Availability>Usually ships the same business day.</Availability>
      <Caption>&gt;ThinkPad 240&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;&gt;ThinkPad 240X 1223&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;&gt;ThinkPad 240Z&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
It's a bunch of xml code actually from an old archive file we have. I need to get the results of this text above to output this:

02K6555, IBM Thinkpad Item, 11.95, Usually ships the same business day., ThinkPad 240, ThinkPad 240X 1223, ThinkPad 240Z

Thanks for any help you can provide.
 
Old 02-23-2008, 12:55 PM   #6
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
OK ... [edit]ooops ... you did. I just didn't scroll to the right
I'll have another look at this later.[/edit]


Code:
/<Code>/ {
  printf( "%s, ", strip( gensub( /.+Code>([^<]+).*/, "\\1","g")))
}
/<Description>/ {
  printf( "%s, ", strip( gensub( /.+Description>([^<]+).*/, "\\1","g")))
}
/<SalePrice>/ {
  printf( "%s, ", strip( gensub( /.+SalePrice>([^<]+).*/, "\\1","g")))
}
/<Availability>/ {
  printf( "%s, ", strip( gensub( /.+Availability>([^<]+).*/, "\\1","g")))
}
/<Caption>/ {
  printf( "%s\n", strip( gensub( /.+Caption>[^;]+;([^&]+)&.*/, "\\1","g")))
}
function strip(string){
  value=gensub( /^ +(.*)/, "\\1","1",string)
  value=gensub( /(.*) +$/, "\\1","1",string)
  return value
}



Cheers,
Tink

Last edited by Tinkster; 02-23-2008 at 02:01 PM. Reason: [edit]
 
Old 02-23-2008, 06:23 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Ok, caption revisited ... and man this is ugly :D
because of all the mark-up in the payload of the
XML bit for the Caption.

It mostly works, I think, but you do get an extra comma
after the last item (you can get rid of that afterwards
separately if it bothers you).
Code:
/<Code>/ {
  printf( "%s, ", strip( gensub( /.+Code>([^<]+).*/, "\\1","g")))
}
/<Description>/ {
  printf( "%s, ", strip( gensub( /.+Description>([^<]+).*/, "\\1","g")))
}
/<SalePrice>/ {
  printf( "%s, ", strip( gensub( /.+SalePrice>([^<]+).*/, "\\1","g")))
}
/<Availability>/ {
  printf( "%s, ", strip( gensub( /.+Availability>([^<]+).*/, "\\1","g")))
}
/<Caption>/ {
  count=split($0, a, /[;,]/);
  printf "\""
  for(i=1;i<=count;i++){
    a[i]=strip( a[i] )
    if ( a[i] !~ /(<|\/|=)/ && a[i] ~ /[^\w]+/ && a[i] != ", &lt" && a[i] != "&lt" && a[i] != "&gt" && a[i] != " &lt" ) {
      printf ( "%s, ", gensub(/([^&]+).*/, "\\1", "G", a[i]))
    }
  }
  printf "\"\n"
}
function strip(string){
  value=gensub( /^ +(.*)/, "\\1","1",string)
  value=gensub( /(.*) +$/, "\\1","1",string)
  return value
}
Code:
$ cat split2
    <Product Id="02k6555">
      <Code>02K6555</Code>
      <Description>IBM Thinkpad Item</Description>
      <Url>http://www.company.com/02k6555.html</Url>
      <Pricing>
        <BasePrice>12.95</BasePrice>
        <LocalizedBasePrice>9.95</LocalizedBasePrice>
        <OrigPrice>16.42</OrigPrice>
        <LocalizedOrigPrice>16.42</LocalizedOrigPrice>
        <SalePrice>11.95</SalePrice>
        <LocalizedSalePrice>11.95</LocalizedSalePrice>
      </Pricing>
      <Availability>Usually ships the same business day.</Availability>
      <Caption>&gt;ThinkPad 240&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New
 Roman' size=2&gt;&lt;&gt;ThinkPad 240X 1223&lt;/a&gt;&lt;/font&gt;, &lt;font fa
ce='Times New Roman' size=2&gt;&lt;&gt;ThinkPad 240Z&lt;/a&gt;&lt;/font&gt;&lt;/
p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
Code:
$ awk -f awk_script split2      
02K6555, IBM Thinkpad Item, 11.95, Usually ships the same business day., "ThinkPad 240, ThinkPad 240X 1223, ThinkPad 240Z, "
Cheers,
Tink

Last edited by Tinkster; 02-23-2008 at 06:26 PM. Reason: added example output
 
Old 02-24-2008, 11:21 AM   #8
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
OK, that works great...Thank you very much.

The only problem is that for the "products" after that one, your script doesn't work. I think cause each product has some different things going on...although they all those same fields i'm trying to get. The main one that does not work is that caption field.

I'm not much of an AWK person...but do you know of any good resources for me to read up on how to really use it? I wouldn't be able to post the entire file here as it's too big, so I just need to understand what is going on in your script so that I can modify it as needed.

-Thanks
 
Old 02-24-2008, 11:51 AM   #9
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
There's the man and info pages, there's the manual (can be viewed
online e.g. at http://www.gnu.org/manual/gawk/html_node/index.html
if your distro didn't install it). The awk newsgroup is a pretty
good source of info, too.

But why don't you just post one or two of the records that didn't
get parsed, or, even better, ask about which part of the script
you didn't understand.



Cheers,
Tink
 
Old 02-24-2008, 12:26 PM   #10
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Rep: Reputation: Disabled
Code:
#!/bin/bash
awk -v FS='<[^>]*>' '
    /<Code>/{ printf $2", " }
    /<Description>/{ printf $2", " }
    /<SalePrice>/{ printf $2", " }
    /<Availability>/{ printf $2", " }
    /<Caption>/{ print $2 }
' $1 | sed '
    s/&gt;//
    s/&lt;[^;]*&gt;//g'
I had this awhile ago but didn't bother posting it because it seemed pretty fragile, plus uses sed in the mix. But since you still seem to be having problems I'll post it on the off chance it'd actually work. I mean, it does for the sample data, but that's probably about it.

As far as documentation, there's http://www.gnu.org/software/gawk/manual/, which may already be on your system. I think it can be available as an info page.

-- Huh. Now that I've posted it, it dawns on me that I should have just put
Code:
#!/bin/bash
awk -v FS='<[^>]*>' '
    /<Code>|<Description>|<SalePrice>|<Availability>/{ printf $2", " }
    /Caption/{ printf $2 }
' $1 | sed '
    s/&gt;//
    s/&lt;[^;]*&gt;//g'
Anyway - re-reading the post, I also realize should clarify that I "didn't bother" *after* Tink was helping you. I don't mean I "didn't bother" to help at all.

Last edited by slakmagik; 02-24-2008 at 12:35 PM. Reason: more concise expression, more clear expression
 
Old 02-24-2008, 05:30 PM   #11
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
OK, here is a much greater post of the sample data. I'll try reading up on the awk pages you guys posted when i have some more time. In the meantime, hopefully it's just something simple that is needed in the code in order get this xml file into a usable csv file (usable for our current software).

Digiot --thank you, but the data was all put into one single line, which would make it very hard to use given that we have a lot of products that this script is going to have to parse. --but it did clean up the output nicely, probably thanks to the sed usage.

---Thanks for the posts.

Code:
<Product Id="00edu">
      <Code>00EDU</Code>
      <Description>Dell Pentium IV 1.3GHz CPU (Processor Module) - 00EDU</Description>
      <Url>http://www.testtest.com/00edu.html</Url>
      <Thumb>&lt;img border=0 width=70 height=46 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989282931&gt;</Thumb>
      <Picture>&lt;img border=0 width=721 height=470 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989284727&gt;</Picture>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>229.95</BasePrice>
        <LocalizedBasePrice>229.95</LocalizedBasePrice>
        <OrigPrice>379.42</OrigPrice>
        <LocalizedOrigPrice>379.42</LocalizedOrigPrice>
        <SalePrice>229.95</SalePrice>
        <LocalizedSalePrice>229.95</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-desktops" Url="http://www.testtest.com/dellparts-dell-desktops.html">Dell Desktop Parts</ProductRef>
        <ProductRef Id="dell-desktop-parts-by-category" Url="http://www.testtest.com/dell-desktop-parts-by-category.html">Dell Desktop Parts by category</ProductRef>
        <ProductRef Id="dell-desktop-parts-by-category-internal-parts-and-assemblies" Url="http://www.testtest.com/dell-desktop-parts-by-category-internal-parts-and-assemblies.html">Internal Parts And Assemblies</ProductRef>
        <ProductRef Id="dell-desktop-parts-by-category-internal-parts-and-assemblies-cpu---p4-processor-boards" Url="http://www.testtest.com/dell-desktop-parts-by-category-internal-parts-and-assemblies-cpu---p4-processor-boards.html">CPU / P4 Processor Boards</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;PRC, 80528, 1.3GHZ, 0K, 400FSB, SK - 1.3GHZ PENTIUM IV SOCKETED&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Reseller Discount &lt;/b&gt;on orders of 5 or more CPU / P4 Processor Boards.&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
    <Product Id="00frw">
      <Code>00FRW</Code>
      <Description>Dell Pentium III 600MZH CPU (Processor Module) - 00FRW</Description>
      <Url>http://www.testtest.com/00frw.html</Url>
      <Thumb>&lt;img border=0 width=70 height=46 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989338728&gt;</Thumb>
      <Picture>&lt;img border=0 width=721 height=470 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989341162&gt;</Picture>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>69.95</BasePrice>
        <LocalizedBasePrice>69.95</LocalizedBasePrice>
        <OrigPrice>115.42</OrigPrice>
        <LocalizedOrigPrice>115.42</LocalizedOrigPrice>
        <SalePrice>69.95</SalePrice>
        <LocalizedSalePrice>69.95</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-laptops" Url="http://www.testtest.com/dellparts-dell-laptops.html">Dell Laptop Parts</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category" Url="http://www.testtest.com/dell-laptop-parts-by-category.html">Dell Laptop Parts by category</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-internal-parts-and-assemblies" Url="http://www.testtest.com/dell-laptop-parts-by-category-internal-parts-and-assemblies.html">Internal Parts And Assemblies</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-internal-parts-and-assemblies-cpu---p3-processor-boards" Url="http://www.testtest.com/dell-laptop-parts-by-category-internal-parts-and-assemblies-cpu---p3-processor-boards.html">CPU / P3 Processor Boards</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;Processor Module, PIII-CUM, 600, 256K, MMC2, B0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%'" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Compatible Models for Dell DP/N 00FRW, 00000FRW&lt;/b&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td width="100%" height="60"&gt;&lt;p align="left"&gt;&lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-3800-parts.html title='Inspiron 3800'&gt;Inspiron 3800&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-5000-parts.html title='Inspiron 5000'&gt;Inspiron 5000&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-5000e-parts.html title='Inspiron 5000E'&gt;Inspiron 5000E&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-7500-parts.html title='Inspiron 7500'&gt;Inspiron 7500&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
    <Product Id="00gmd">
      <Code>00GMD</Code>
      <Description>Dell 1.44M Floppy Drive/ 8X DVD DVD Combo Unit/ CDR/ Burner - 00GMD</Description>
      <Url>http://www.testtest.com/00gmd.html</Url>
      <Thumb>&lt;img border=0 width=70 height=46 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989454980&gt;</Thumb>
      <Picture>&lt;img border=0 width=721 height=470 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989457233&gt;</Picture>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>199.95</BasePrice>
        <LocalizedBasePrice>199.95</LocalizedBasePrice>
        <OrigPrice>329.92</OrigPrice>
        <LocalizedOrigPrice>329.92</LocalizedOrigPrice>
        <SalePrice>199.95</SalePrice>
        <LocalizedSalePrice>199.95</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-laptops" Url="http://www.testtest.com/dellparts-dell-laptops.html">Dell Laptop Parts</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category" Url="http://www.testtest.com/dell-laptop-parts-by-category.html">Dell Laptop Parts by category</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-storage" Url="http://www.testtest.com/dell-laptop-parts-by-category-storage.html">Storage</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-storage-dvd-drives---cd-rw---floppy-drives" Url="http://www.testtest.com/dell-laptop-parts-by-category-storage-dvd-drives---cd-rw---floppy-drives.html">DVD Drives / CD-RW / Floppy Drives</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;Floppy Drive, 8X, COMBO, Toshiba&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%'" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Compatible Models for Dell DP/N 00GMD, 000GMD&lt;/b&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td width="100%" height="60"&gt;&lt;p align="left"&gt;&lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-7500-parts.html title='Inspiron 7500'&gt;Inspiron 7500&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
    <Product Id="00gpm">
      <Code>00GPM</Code>
      <Description>Dell PowerEdge 1400SC PERC2 RAID Controller Card (64MB) - 00GPM</Description>
      <Url>http://www.testtest.com/00gpm.html</Url>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>295.00</BasePrice>
        <LocalizedBasePrice>295.00</LocalizedBasePrice>
        <OrigPrice>486.75</OrigPrice>
        <LocalizedOrigPrice>486.75</LocalizedOrigPrice>
        <SalePrice>295.00</SalePrice>
        <LocalizedSalePrice>295.00</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-servers" Url="http://www.testtest.com/dellparts-dell-servers.html">Dell Server Parts</ProductRef>
        <ProductRef Id="dell-server-parts-by-category" Url="http://www.testtest.com/dell-server-parts-by-category.html">Dell Server Parts by category</ProductRef>
        <ProductRef Id="dell-server-parts-by-category-internal-parts-and-assemblies" Url="http://www.testtest.com/dell-server-parts-by-category-internal-parts-and-assemblies.html">Internal Parts And Assemblies</ProductRef>
        <ProductRef Id="dell-server-parts-by-category-internal-parts-and-assemblies-raid-boards" Url="http://www.testtest.com/dell-server-parts-by-category-internal-parts-and-assemblies-raid-boards.html">Raid Boards</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;PERC2-DC, 64M, LONG&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%'" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Compatible Models for Dell DP/N 00GPM, 000GPM&lt;/b&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td width="100%" height="60"&gt;&lt;p align="left"&gt;&lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-servers-dell-poweredge-1400sc-parts.html title='PowerEdge 1400SC'&gt;PowerEdge 1400SC&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
    <Product Id="00grh">
      <Code>00GRH</Code>
      <Description>Dell Emi Processor Shield - 00GRH</Description>
      <Url>http://www.testtest.com/00grh.html</Url>
      <Thumb>&lt;img border=0 width=70 height=46 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989516437&gt;</Thumb>
      <Picture>&lt;img border=0 width=721 height=470 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989519463&gt;</Picture>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>22.95</BasePrice>
        <LocalizedBasePrice>22.95</LocalizedBasePrice>
        <OrigPrice>37.87</OrigPrice>
        <LocalizedOrigPrice>37.87</LocalizedOrigPrice>
        <SalePrice>22.95</SalePrice>
        <LocalizedSalePrice>22.95</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-laptops" Url="http://www.testtest.com/dellparts-dell-laptops.html">Dell Laptop Parts</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category" Url="http://www.testtest.com/dell-laptop-parts-by-category.html">Dell Laptop Parts by category</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-internal-parts-and-assemblies" Url="http://www.testtest.com/dell-laptop-parts-by-category-internal-parts-and-assemblies.html">Internal Parts And Assemblies</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-internal-parts-and-assemblies-brackets---holders---spacers---plates" Url="http://www.testtest.com/dell-laptop-parts-by-category-internal-parts-and-assemblies-brackets---holders---spacers---plates.html">Brackets / Holders / Spacers / Plates</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;Shield, Shielded, Processor, Metal, CH-ST, V2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%'" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Compatible Models for Dell DP/N 00GRH, 000GRH&lt;/b&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td width="100%" height="60"&gt;&lt;p align="left"&gt;&lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-inspiron-3800-parts.html title='Inspiron 3800'&gt;Inspiron 3800&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-c610-parts.html title='Latitude C610'&gt;Latitude C610&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpt-s-parts.html title='Latitude CPT S'&gt;Latitude CPT S&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpx-650gt-parts.html title='Latitude CPX 650GT'&gt;Latitude CPX 650GT&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpx-j-parts.html title='Latitude CPX J'&gt;Latitude CPX J&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
    <Product Id="00jkk">
      <Code>00JKK</Code>
      <Description>Dell 60GB Laptop Hard Drive (9.5mm/ 2.5) - 00JKK</Description>
      <Url>http://www.testtest.com/00jkk.html</Url>
      <Thumb>&lt;img border=0 width=70 height=46 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989632201&gt;</Thumb>
      <Picture>&lt;img border=0 width=576 height=375 src=http://us.st11.yimg.com/us.st.yimg.com/I/testtest_1990_3989634989&gt;</Picture>
      <Weight>0.5</Weight>
      <Orderable>YES</Orderable>
      <Taxable>YES</Taxable>
      <Pricing>
        <BasePrice>119.95</BasePrice>
        <LocalizedBasePrice>119.95</LocalizedBasePrice>
        <OrigPrice>197.92</OrigPrice>
        <LocalizedOrigPrice>197.92</LocalizedOrigPrice>
        <SalePrice>119.95</SalePrice>
        <LocalizedSalePrice>119.95</LocalizedSalePrice>
      </Pricing>
      <Path>
        <ProductRef Id="dellparts" Url="http://www.testtest.com/dellparts.html">Dell Parts</ProductRef>
        <ProductRef Id="dellparts-dell-laptops" Url="http://www.testtest.com/dellparts-dell-laptops.html">Dell Laptop Parts</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category" Url="http://www.testtest.com/dell-laptop-parts-by-category.html">Dell Laptop Parts by category</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-storage" Url="http://www.testtest.com/dell-laptop-parts-by-category-storage.html">Storage</ProductRef>
        <ProductRef Id="dell-laptop-parts-by-category-storage-laptop-hard-drives--9-5mm--2-5--2" Url="http://www.testtest.com/dell-laptop-parts-by-category-storage-laptop-hard-drives--9-5mm--2-5--2.html">Laptop Hard Drives (9.5mm/ 2.5) 2</ProductRef>
      </Path>
      <Availability>&lt;iframe name="I1" src="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"      border="0" frameborder="0" title="inventory" width="100%" height="19" scrolling="no" marginheight="0" marginwidth="0"&gt;  &lt;a href="http://www.test.com/cgi-bin/test_tools/inventorycheck.cgi"&gt; get_inventory &lt;/a&gt; &lt;/iframe&gt;</Availability>
      <Caption>&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="3"&gt; &lt;tr&gt; &lt;td width="132" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Category&lt;/b&gt;&lt;/td&gt; &lt;td width="70%" height="20" bgcolor="#D2E4FF"&gt;Hard Drives&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;  &lt;td width="151" height="26"&gt;&lt;p align="left"&gt;&lt;b&gt;Part Number (s)&lt;/b&gt;&lt;/p&gt; &lt;/td&gt;  &lt;td width="1321" height="26"&gt;00JKK, 000JKK&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td width="151" height="26"&gt; &lt;b&gt;Size&lt;/b&gt;&lt;/td&gt; &lt;td width="1321" height="26"&gt;60GB&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td width="151" height="26"&gt;&lt;b&gt;Height&lt;/b&gt;&lt;/td&gt;  &lt;td width="1321" height="26"&gt; 9.5mm&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td width="151" height="26"&gt;&lt;b&gt;Form Factor&lt;/b&gt;&lt;/td&gt;  &lt;td width="1321" height="26"&gt;2.5&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td width="151" height="26"&gt; &lt;b&gt;RPM&lt;/b&gt;&lt;/td&gt; &lt;td width="1321" height="26"&gt;4.2k&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td width="151" height="26"&gt; &lt;b&gt;Interface&lt;/b&gt;&lt;/td&gt; &lt;td width="1321" height="26"&gt;IDE&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#C0C0C0" width="100%" height="54"&gt; &lt;tr&gt;&lt;td width="100%" height="20" bgcolor="#D2E4FF"&gt;&lt;b&gt;Notes: &lt;/b&gt;HD, 60.0GB, 9.5M, M, IBM, C - 6GB HDD, IBM&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="100%'" height="21" bgcolor="#EAEAEA"&gt;&lt;p align="left"&gt;&lt;b&gt;Compatible Models for Dell DP/N 00JKK, 000JKK&lt;/b&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td width="100%" height="60"&gt;&lt;p align="left"&gt;&lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cp-parts.html title='Latitude CP'&gt;Latitude CP&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpi-366-parts.html title='Latitude CPI 366'&gt;Latitude CPI 366&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpt--parts.html title='Latitude CPt'&gt;Latitude CPt&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cptc-parts.html title='Latitude CPTC'&gt;Latitude CPTC&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpx--parts.html title='Latitude CPx'&gt;Latitude CPx&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cpx-650gt-parts.html title='Latitude CPX 650GT'&gt;Latitude CPX 650GT&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-cs-parts.html title='Latitude CS'&gt;Latitude CS&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-csx-h-parts.html title='Latitude CSx H'&gt;Latitude CSx H&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-l400-parts.html title='Latitude L400'&gt;Latitude L400&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-lm-parts.html title='Latitude LM'&gt;Latitude LM&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-lmp-parts.html title='Latitude LMP'&gt;Latitude LMP&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-ls-parts.html title='Latitude LS'&gt;Latitude LS&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-lt-parts.html title='Latitude LT'&gt;Latitude LT&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-lx-parts.html title='Latitude LX'&gt;Latitude LX&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-lxp-parts.html title='Latitude LXP'&gt;Latitude LXP&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-xp-parts.html title='Latitude XP'&gt;Latitude XP&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-xpi-parts.html title='Latitude XPI'&gt;Latitude XPI&lt;/a&gt;&lt;/font&gt;, &lt;font face='Times New Roman' size=2&gt;&lt;a href=http://www.testtest.com/dellparts-dell-laptops-dell-latitude-xpi-cd-parts.html title='Latitude XPI CD'&gt;Latitude XPI CD&lt;/a&gt;&lt;/font&gt;&lt;/p&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</Caption>
    </Product>
 
Old 02-24-2008, 07:15 PM   #12
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Fudge .... :} ... this is even more hideous than I initially thought. It looks
like for these entries the availability data is being fetched on the fly. I don't
think I can trivially coerce awk into going to those web-sites for you ;}

Shall we silently drop them if we find an src-href in them and replace them with
an empty field?

That aside (the ugly dynamic availability, that is) a combination of digiot's sed
idea (with perl for reduced greedy-ness of the regex) makes for quite tidy output.


Modified awk:
Code:
/<Code>/ {
  printf( "%s, ", strip( gensub( /.+Code>([^<]+).*/, "\\1","g")))
}
/<Description>/ {
  printf( "%s, ", strip( gensub( /.+Description>([^<]+).*/, "\\1","g")))
}
/<SalePrice>/ {
  printf( "%s, ", strip( gensub( /.+SalePrice>([^<]+).*/, "\\1","g")))
}
/<Availability>/ {
  if ( $0 !~ /http:/ ) {
    printf( "%s, ", strip( gensub( /.+Availability>([^<]+).*/, "\\1","g")))
  } else {
    printf( "\"\", ")
  }
}
/<Caption>/ {
  printf( "%s, ", strip( gensub( /.+Caption>([^<]+).*/, "\\1","g")))
  printf "\"\n"
}
function strip(string){
  value=gensub( /^ +(.*)/, "\\1","1",string)
  value=gensub( /(.*) +$/, "\\1","1",string)
  return value
}
And for the invocation:
Code:
awk -f test.awk split | perl -pe  's/&lt;[^;]*?&gt;//g'
00EDU, Dell Pentium IV 1.3GHz CPU (Processor Module) - 00EDU, 229.95, "",  Notes: PRC, 80528, 1.3GHZ, 0K, 400FSB, SK - 1.3GHZ PENTIUM IV SOCKETEDReseller Discount on orders of 5 or more CPU / P4 Processor Boards. , "
00FRW, Dell Pentium III 600MZH CPU (Processor Module) - 00FRW, 69.95, "",  Notes: Processor Module, PIII-CUM, 600, 256K, MMC2, B0Compatible Models for Dell DP/N 00FRW, 00000FRW   Inspiron 3800, Inspiron 5000, Inspiron 5000E, Inspiron 7500 , "
00GMD, Dell 1.44M Floppy Drive/ 8X DVD DVD Combo Unit/ CDR/ Burner - 00GMD, 199.95, "",  Notes: Floppy Drive, 8X, COMBO, ToshibaCompatible Models for Dell DP/N 00GMD, 000GMD   Inspiron 7500 , "
00GPM, Dell PowerEdge 1400SC PERC2 RAID Controller Card (64MB) - 00GPM, 295.00, "",  Notes: PERC2-DC, 64M, LONGCompatible Models for Dell DP/N 00GPM, 000GPM   PowerEdge 1400SC , "
00GRH, Dell Emi Processor Shield - 00GRH, 22.95, "",  Notes: Shield, Shielded, Processor, Metal, CH-ST, V2Compatible Models for Dell DP/N 00GRH, 000GRH   Inspiron 3800, Latitude C610, Latitude CPT S, Latitude CPX 650GT, Latitude CPX J , "
00JKK, Dell 60GB Laptop Hard Drive (9.5mm/ 2.5) - 00JKK, 119.95, "",   Category Hard Drives   Part Number (s)   00JKK, 000JKK  Size 60GB Height   9.5mm Form Factor  2.5  RPM 4.2k   Interface IDE Notes: HD, 60.0GB, 9.5M, M, IBM, C - 6GB HDD, IBMCompatible Models for Dell DP/N 00JKK, 000JKK   Latitude CP, Latitude CPI 366, Latitude CPt, Latitude CPTC, Latitude CPx, Latitude CPX 650GT, Latitude CS, Latitude CSx H, Latitude L400, Latitude LM, Latitude LMP, Latitude LS, Latitude LT, Latitude LX, Latitude LXP, Latitude XP, Latitude XPI, Latitude XPI CD , "


Cheers,
Tink
 
Old 02-24-2008, 07:30 PM   #13
xmrkite
Member
 
Registered: Oct 2006
Location: California, USA
Distribution: Mint 16, Lubuntu 14.04, Mythbuntu 14.04, Kubuntu 13.10, Xubuntu 10.04
Posts: 542

Original Poster
Rep: Reputation: 30
OK, so from the command line i run:

awk -f try1 split | perl -pe 's/&lt;[^;]*?&gt;//g'

try1 is the file where i put your code, and split is the file that contains the xml data...is that right?

Cause it returns this for me:

, "",
, "",
, "",
, "",
, "",
, "",


I'm using ubuntu 7.04.

---Please advise
-Thanks
 
Old 02-24-2008, 07:44 PM   #14
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Unnnf ... what's happening when you omit the perl-bit?
The perl thing is just there to clean up the HTML tags...

I'm running slackware 12 here, and it works a treat ;}


Cheers,
Tink
 
Old 02-24-2008, 07:56 PM   #15
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Ok ... if you're going with the Ubuntu default you should have seen an error
message of sorts.

Code:
awk -f test.awk split | perl -pe  's/&lt;[^;]*?&gt;//g'
awk: test.awk: line 36: function gensub never defined
awk: test.awk: line 36: function gensub never defined
awk: test.awk: line 36: function gensub never defined
awk: test.awk: line 36: function gensub never defined
awk: test.awk: line 36: function gensub never defined
Try apt-get install gawk and then run
Code:
gawk -f test.awk split | perl -pe  's/&lt;[^;]*?&gt;//g'
00EDU, Dell Pentium IV 1.3GHz CPU (Processor Module) - 00EDU, 229.95, "",  Notes: PRC, 80528, 1.3GHZ, 0K, 400FSB, SK - 1.3GHZ PENTIUM IV SOCKETEDReseller Discount on orders of 5 or more CPU / P4 Processor Boards. , "
00FRW, Dell Pentium III 600MZH CPU (Processor Module) - 00FRW, 69.95, "",  Notes: Processor Module, PIII-CUM, 600, 256K, MMC2, B0Compatible Models for Dell DP/N 00FRW, 00000FRW   Inspiron 3800, Inspiron 5000, Inspiron 5000E, Inspiron 7500 , "
00GMD, Dell 1.44M Floppy Drive/ 8X DVD DVD Combo Unit/ CDR/ Burner - 00GMD, 199.95, "",  Notes: Floppy Drive, 8X, COMBO, ToshibaCompatible Models for Dell DP/N 00GMD, 000GMD   Inspiron 7500 , "
00GPM, Dell PowerEdge 1400SC PERC2 RAID Controller Card (64MB) - 00GPM, 295.00, "",  Notes: PERC2-DC, 64M, LONGCompatible Models for Dell DP/N 00GPM, 000GPM   PowerEdge 1400SC , "
00GRH, Dell Emi Processor Shield - 00GRH, 22.95, "",  Notes: Shield, Shielded, Processor, Metal, CH-ST, V2Compatible Models for Dell DP/N 00GRH, 000GRH   Inspiron 3800, Latitude C610, Latitude CPT S, Latitude CPX 650GT, Latitude CPX J , "
00JKK, Dell 60GB Laptop Hard Drive (9.5mm/ 2.5) - 00JKK, 119.95, "",   Category Hard Drives   Part Number (s)   00JKK, 000JKK  Size 60GB Height   9.5mm Form Factor  2.5  RPM 4.2k   Interface IDE Notes: HD, 60.0GB, 9.5M, M, IBM, C - 6GB HDD, IBMCompatible Models for Dell DP/N 00JKK, 000JKK   Latitude CP, Latitude CPI 366, Latitude CPt, Latitude CPTC, Latitude CPx, Latitude CPX 650GT, Latitude CS, Latitude CSx H, Latitude L400, Latitude LM, Latitude LMP, Latitude LS, Latitude LT, Latitude LX, Latitude LXP, Latitude XP, Latitude XPI, Latitude XPI CD , "

Cheers,
Tink
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract spesific text from an HTML file mister_0101 Programming 6 07-24-2005 04:50 PM
Getting info from text file alts Programming 16 11-19-2004 01:03 AM
Extract text from a html file gsphanikumar6 Linux - Newbie 2 08-20-2004 01:11 PM
PHP & MySQL getting info from text file neon Programming 1 10-15-2003 12:34 AM
linux shell - extract filename from and song info from text database d003 Programming 1 07-23-2003 04:06 AM


All times are GMT -5. The time now is 10:02 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration