How to strip a specific part of text from a larger file?

pepsi_max2k · 03-26-2009, 10:31 AM

Hey all, I'm trying to wget a file, strip out a part of it, and paste it in to another file. I can do the wget bit easy, but the stripping is kinda harder

File will look a little like:

Code:

blah blah blah ablahs fergerg
rgedrgjdfhgblah blah blah <table>unique text blah blah
blah blah blah unique text</table> blah blah
blah blah

And I want to strip that down to just "<table>unique text ... unique text</table>" and I'm not sure where to start.

Using bash, I've tried grep, which is no good as it just gets me a single line and I can say "keep going until you get to this line". head, tail no good as the rest of the line counts will change in the source file. as will the line count of everything between the unique text. awk, sed don't seem like they'll be much help either, or split, or anything i've seen so far.

I have access to php if that helps. Actually, I eventually want to past all the text in to a php file, but that shouldn't be too hard I hope. Anyway, thanks for any pointers.

Hko · 03-26-2009, 11:14 AM

This very similar thread may help.

pepsi_max2k · 03-26-2009, 11:52 AM

very close, but it doesn't deal with line breaks. ie.

bleh1 start bleh2 end bleh3

gives "bleh2". but

bleh1
start bleh2
end bleh3

doesn't give anything

is there an easy way to solve that? thanks for the tip though

EDIT: getting closer...

Quote:

cat text | sed -n -e ":a" -e "$ s/\n//gp;N;b a" | sed -n 's/^.*<table>unique$.*$text<\/table>.*$/\1/p'

sed'ing multiple lines looked hard, so i got rid of the lines instead

not exactly sure what it does, and it could mess up in some cases (if theres no white space then words become joined), but in my case it's pretty much ok as anything before and after a line break is html

EDIT 2: meh, that first sed command is stipping a lot more than linebreaks, half the file text is missing

halp! please?

EDIT 3:

Quote:

while read line; do echo -n "$line "; done <infile >outfile

helpful, but when I cat the resulting text file there's still lots of bits missing, and sed doesn't see any more than cat

David1357 · 03-26-2009, 01:02 PM

Quote:

Originally Posted by pepsi_max2k

I have access to php if that helps.

If you have PHP, it should be easy to do the whole thing in there. If you really want to use sed, here is a script

Code:

/<table>/,/<\/table>/{
    N
    s/.*\(<table>\)/\1/
    s/\(<\/table>\).*/\1/
    s/\n/ /
    P
}

This says match across lines starting with "<table>" and ending with "</table>". Suck up the next line. Remove any text before the "<table>" and after the "</table>". Remove any newlines in the match. Print the result to stdout.

Name this pepsi.sed or something equally meaningful. Assuming your aforementioned text is in pepsi.txt, you can use

Code:

[machine:~]:sed -n -f pepsi.sed < pepsi.txt

to produce

Code:

<table>unique text blah blah blah blah blah unique text</table>

The "-n" option to sed tells it to suppress the automatic output of unmatched lines.

pepsi_max2k · 03-26-2009, 01:42 PM

Thanks, I'll give it a go in a bit, looks good though

fwiw, if I were to use a purely php method, what sort of things should I be looking at? I just tried something with an array and preg_match_all but I wasn't really getting anywhere, and wasn't sure if it was the right way to go about it anyway so gave up after an hour looking for clues

it's actually to grab some html from one web page and stick it on a different web page, so it'd be simple to do it all in php in that one page if I knew how.

pixellany · 03-26-2009, 02:07 PM

Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename

Note: Those are not capital Vs---the "/" in "</table>" has to be escaped because "/" is also the sed s delimiter.

pepsi_max2k · 03-26-2009, 03:17 PM

Thanks for the help guys, can't seem to get any method to work properly yet but getting closer. I guess I should put the actual code I'm working with below, so excuse the mess.

Running pixellany's code:

Quote:

sed -n '/SINGSTAR/,/singing/{s/^.*SINGSTAR/SINGSTAR/p; s/singing.*$/singing/p}' log

outputs:

Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
SINGSTAR POPWORLD PlayStation 2 Game singing

Should just be "SINGSTAR POPWORLD PlayStation 2 Game singing". And trying to do multiple lines is even worse, missing out loads of text.

Quote:

sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log

Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td>NARUTO

David1357's:

Quote:

/SINGSTAR/,/singing/{
N
s/.*$SINGSTAR$/\1/
s/$singing$.*/\1/
s/\n/ /
P
}

Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing
<tr bgcolor="">
<td align="left">19-Mar-09</td>/cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380924436">220380924436</a></td>
<td align="right">Â£3.99</td>
<td align="left">E NINJNo Bids Yettatio</td>ame (PS2 PS3)</td>
<tr bgcolor="">
and proceeds to output the rest of the file (inc weird character additions, as above)

:| I'm stumped.

Quote:

Custom text.....<br /><br />

<a name="sales"></a><center>
To protect bidder privacy, when the price or highest bid on an item reaches or exceeds a certain level, User IDs will be displayed as anonymous names. For auction items, a bold price means at least one bid has been received. <br><br><b>Note:</b> Anonymous names may appear more than once and may represent different bidders.<br><br><table border="1" width="
concat($Attributes/tablewidth,'%')
" cellpadding="2">

<caption></caption>
<tr>
<th bgcolor="#cccccc" align="center">
Item
</th>
<th bgcolor="#cccccc" align="center">
Start
</th>
<th bgcolor="#cccccc" align="center">
End
</th>
<th bgcolor="#cccccc" align="center">
Price
</th>

<th bgcolor="#cccccc" align="center">
Title
</th>
<th bgcolor="#cccccc" align="center">
High Bidder / Status
</th>
</tr>
</tr>

<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380910274">220380910274</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 19:45:34</td>
<td align="right"><b>£3.20</b></td>
<td>SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&mode=1&item=220380910274&aid=2&eu=EKfHyOJV8ns5rY4sRZ 0Lacr3gfQXfj%2Ft">Bidder 2</a><img alt="Feedback score is 500 to 999" title="Feedback score is 500 to 999" src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconPurpleStar_25x25.gif" width="25" align="absmiddle" border="0" height="25"><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380924436">220380924436</a></td>

<td align="left">19-Mar-09</td>
<td>29-Mar-09 20:01:51</td>
<td align="right">£3.99</td>
<td>NARUTO ULTIMATE NINJA 2 - PlayStation 2 Game (PS2 PS3)</td>
<td align="left"> No Bids Yet </td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380940639">220380940639</a></td>
<td align="left">19-Mar-09</td>

<td>29-Mar-09 20:20:28</td>
<td align="right"><b>£1.99</b></td>
<td>MONSTER HUNTER PlayStation 2 Game capcom online RPG PS2</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&mode=1&item=220380940639&aid=1&eu=In1%2BJ5pznJvqvcIM IUfUlI3g%2F5YHMy%2F2">Bidder 1</a><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"><img src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconNewId_16x16.gif" alt="New eBay Member (less than 30 days)" width="16" border="0" height="16"><img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>

</table><br>Go see all current <a href="http://cgi6.ebay.co.uk/ws/eBayISAPI.dll?ViewSellersOtherItems&userid=inaudible-games&include=0&sort=3&rows=25&since=-1&rd=1">items for sale</a> by this member.<br><br></center><br /><br />

<a name="rate"></a>Custom text here....

I'm trying to get the code inbetween the <table> after the first paragraph, up to the closing </table> tag for it. There's load of other text and code before and after it too, incase you wondered.

EDIT:

Quote:

sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log

is actually almost right. I'm just loosing all the lines inbetween the start and end values, any way to keep em?

EDIT 2: ohhh I love it when things are this simple

thanks David1357 and pixellany - pixel missed out an N so...

Quote:

sed -n '/SINGSTAR/,/NARUTO/{N s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log

seems to work a treat

Now to get this all echoed in to a php file properly... tomorrow maybe

EDIT: spoke too soon. trying to grab longer bits just messes up, with still more lines being missed off and still weird characters appearing

pixellany · 03-26-2009, 03:46 PM

My code worked for the first sample you gave (blah blah blah).

To help you deal with this, have you dissected the code to see what each piece does?

Here is a quick overview:

Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename

-n Don't print unless told to

/<table>/,/<\/table>/ address range--do the following for every line in this range

{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p} two commands grouped--for every line in the range, do both of these

s/^.*<table>/<table>/p replace everything from the beginning of the line up to and including the LAST "<table"---with "<table>", then print

s/<\/table>.*$/<\/table>/p replace everything from the first "</table>" to the end of the line---with "</table>", then print

Thus, the lines where no substitution is done will not print!!

Try this:

Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/; s/<\/table>.*$/<\/table>/; p}' filename

Now it should print every line in the range, regardless of whether it was modified.

pepsi_max2k · 03-26-2009, 04:06 PM

yay, works. i think. hopefully.

thanks very much. and sorry, should have made the original sample more obvious that there was extra text (and lines) inbetween the stuff I wanted to rip.

And I've got a basic understanding of some basic commands, just didn't really know what to start changing (started with N, quickly figured that was wrong). obviously it was just a case of moving a p around. would have taken me days to figure that out though, so thanks for the help

edit:

Quote:

sed -n '/<\/caption>/,/<\/table><br>Go see/{s/^.*<\/caption>/<table>/; s/<\/table><br>Go see.*$/<\/table>/; p}' log

Is perfect for the main code I put above, just needs an Â sed'ing out next to the prices and I'm good to go

Thanks again all.

edit 2: awk it is then.

Quote:

cat log | awk '{ gsub(/Â/, ""); print }' > log2

ghostdog74 · 03-26-2009, 07:34 PM

Quote:

Originally Posted by pepsi_max2k

fwiw, if I were to use a purely php method, what sort of things should I be looking at?

with PHP

Code:

preg_match("/<table>(.*?)<\/table>/smi",$string,$s);

pepsi_max2k · 03-27-2009, 04:00 AM

Cheers Ghost, finally ended up with:

PHP Code:



<?php

// Get HTML from url.
$url = "http://members.ebay.co.uk/ws/eBayISAPI.dll?ViewUserPage&userid=MY_USER_ID";
$input = @file_get_contents($url) or die('Could not access file: $url');

// Match section of text in HTML and store in array.
preg_match("/bidders.<br><br><table(.*?)<\/table><br>Go see/smi",$input,$s);

// Remove un-needed text and images.
$preoutput = $s[0];
$remove = array('/Â/', '/bidders.<br><br>/', '/<br>Go see/', '/<img(.*?)>/');
$output = preg_replace($remove, '', $preoutput);

// Print text.
echo $output;

?>

Which works a treat. Now I just have to decide if I want this running on every page load, or use the bash way to run via cron once a day :|