LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Extract "itemTitle" from ebay web page (https://www.linuxquestions.org/questions/programming-9/extract-itemtitle-from-ebay-web-page-591966/)

DIMonS 10-15-2007 11:08 AM

Extract "itemTitle" from ebay web page
 
I am trying to extract the itemTitle of an ebay item from the html page that I have saved and use it in another part of a script.

I have come as far as:

cat ebayISAPI.dll\......html | grep class\=\"itemTitle\"\> which give me the block of text with 1 instance of the phrase itemTitle that I want to use ....

but can't seem to get the | sed -n '/itemTitle/,/h1/p' to work at all depite going almost blind reading the man pages and examples that I have found.

I am going along the right lines I think but confirmation would be good so I can continue my research would be helpful.

The intention is to them rename the html file to the title of the item which I think I have sussed.

TY

DIMonS 10-16-2007 02:52 AM

It is sort of working
 
A little research has revealed that the sed -n '/itemTitle/,/h1/p' is working in that it prints the whole line that includes the start expression. So it is doing the same as grep class\=\"itemTitle\"\>. Pointers to gets getting the text between the start and end expressions?

TY

ghostdog74 10-16-2007 04:10 AM

give a sample of that html page, as well as the things you want to get.

DIMonS 10-16-2007 05:50 AM

This is the last part of the output from the grep

....imgsrc="http://pics.ebaystatic.com/aw/pics/globalAssets/ltCurve.gif" width="8" height="8"></td><td></td><td class="titlePadding"><h1 class="itemTitle"></h1></td><td width="100%" class="titlePadding"><h1 class="itemTitle">WW2 RAF Spitfire secret signalling transmitter</h1></td><td align="right" nowrap>

It is just the bold part, obviously changes with each new file, that I want to be able to use to rename the same file in another script that I found on this web site. Awesome resource don't you think!

DIMonS 11-05-2007 07:27 AM

Nearly there.

So .... awk 'NR>1&&$0=RS$1$2$3' RS="itemTitle\">" filename works a treat and gives the result of WW2RAFSptifire

Adding a $4 adds the next word surrounded by spaces eg WW2RAFSptifiresecret.

But for the gold plated version .... what can I do to add all the words up to the </h1> or should I cut my losses and go with what I have.

DIMonS

angrybanana 11-05-2007 11:29 AM

AWK
Just use '<' as the field seperator and grab the first field.
Code:

awk -F'<' 'NR>1&&$0=$1' RS='<h1 class="itemTitle">'
perl or sed might be better for this.

edit:
Perl
Code:

perl -lne 'print for m{<h1 class="itemTitle">(.*?)</h1>}g'

DIMonS 11-06-2007 06:36 AM

TY V much all.

DIMonS


All times are GMT -5. The time now is 06:17 AM.