LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How to strip a specific part of text from a larger file? (https://www.linuxquestions.org/questions/programming-9/how-to-strip-a-specific-part-of-text-from-a-larger-file-714605/)

pepsi_max2k 03-26-2009 10:31 AM

How to strip a specific part of text from a larger file?
 
Hey all, I'm trying to wget a file, strip out a part of it, and paste it in to another file. I can do the wget bit easy, but the stripping is kinda harder :(

File will look a little like:

Code:

blah blah blah ablahs fergerg
rgedrgjdfhgblah blah blah <table>unique text blah blah
blah blah blah unique text</table> blah blah
blah blah

And I want to strip that down to just "<table>unique text ... unique text</table>" and I'm not sure where to start.

Using bash, I've tried grep, which is no good as it just gets me a single line and I can say "keep going until you get to this line". head, tail no good as the rest of the line counts will change in the source file. as will the line count of everything between the unique text. awk, sed don't seem like they'll be much help either, or split, or anything i've seen so far.

I have access to php if that helps. Actually, I eventually want to past all the text in to a php file, but that shouldn't be too hard I hope. Anyway, thanks for any pointers.

Hko 03-26-2009 11:14 AM

This very similar thread may help.

pepsi_max2k 03-26-2009 11:52 AM

very close, but it doesn't deal with line breaks. ie.

bleh1 start bleh2 end bleh3

gives "bleh2". but

bleh1
start bleh2
end bleh3

doesn't give anything :( is there an easy way to solve that? thanks for the tip though :)


EDIT: getting closer...


Quote:

cat text | sed -n -e ":a" -e "$ s/\n//gp;N;b a" | sed -n 's/^.*<table>unique\(.*\)text<\/table>.*$/\1/p'

sed'ing multiple lines looked hard, so i got rid of the lines instead ;) not exactly sure what it does, and it could mess up in some cases (if theres no white space then words become joined), but in my case it's pretty much ok as anything before and after a line break is html :)

EDIT 2: meh, that first sed command is stipping a lot more than linebreaks, half the file text is missing :( halp! please?

EDIT 3:

Quote:

while read line; do echo -n "$line "; done <infile >outfile
helpful, but when I cat the resulting text file there's still lots of bits missing, and sed doesn't see any more than cat :(

David1357 03-26-2009 01:02 PM

Quote:

Originally Posted by pepsi_max2k (Post 3488515)
I have access to php if that helps.

If you have PHP, it should be easy to do the whole thing in there. If you really want to use sed, here is a script
Code:

/<table>/,/<\/table>/{
    N
    s/.*\(<table>\)/\1/
    s/\(<\/table>\).*/\1/
    s/\n/ /
    P
}

This says match across lines starting with "<table>" and ending with "</table>". Suck up the next line. Remove any text before the "<table>" and after the "</table>". Remove any newlines in the match. Print the result to stdout.

Name this pepsi.sed or something equally meaningful. Assuming your aforementioned text is in pepsi.txt, you can use
Code:

[machine:~]:sed -n -f pepsi.sed < pepsi.txt
to produce
Code:

<table>unique text blah blah blah blah blah unique text</table>
The "-n" option to sed tells it to suppress the automatic output of unmatched lines.

pepsi_max2k 03-26-2009 01:42 PM

Thanks, I'll give it a go in a bit, looks good though :)

fwiw, if I were to use a purely php method, what sort of things should I be looking at? I just tried something with an array and preg_match_all but I wasn't really getting anywhere, and wasn't sure if it was the right way to go about it anyway so gave up after an hour looking for clues :(

it's actually to grab some html from one web page and stick it on a different web page, so it'd be simple to do it all in php in that one page if I knew how.

pixellany 03-26-2009 02:07 PM

Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename
Note: Those are not capital Vs---the "/" in "</table>" has to be escaped because "/" is also the sed s delimiter.

pepsi_max2k 03-26-2009 03:17 PM

Thanks for the help guys, can't seem to get any method to work properly yet but getting closer. I guess I should put the actual code I'm working with below, so excuse the mess.

Running pixellany's code:

Quote:

sed -n '/SINGSTAR/,/singing/{s/^.*SINGSTAR/SINGSTAR/p; s/singing.*$/singing/p}' log
outputs:

Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
SINGSTAR POPWORLD PlayStation 2 Game singing
Should just be "SINGSTAR POPWORLD PlayStation 2 Game singing". And trying to do multiple lines is even worse, missing out loads of text.

Quote:

sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td>NARUTO







David1357's:

Quote:

/SINGSTAR/,/singing/{
N
s/.*\(SINGSTAR\)/\1/
s/\(singing\).*/\1/
s/\n/ /
P
}

Quote:

SINGSTAR POPWORLD PlayStation 2 Game singing
<tr bgcolor="">
<td align="left">19-Mar-09</td>/cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380924436">220380924436</a></td>
<td align="right">£3.99</td>
<td align="left">E NINJNo Bids Yettatio</td>ame (PS2 PS3)</td>
<tr bgcolor="">
and proceeds to output the rest of the file (inc weird character additions, as above)

:| I'm stumped.



Quote:

Custom text.....<br /><br />

<a name="sales"></a><center>
To protect bidder privacy, when the price or highest bid on an item reaches or exceeds a certain level, User IDs will be displayed as anonymous names. For auction items, a bold price means at least one bid has been received. <br><br><b>Note:</b> Anonymous names may appear more than once and may represent different bidders.<br><br><table border="1" width="
concat($Attributes/tablewidth,'%')
" cellpadding="2">

<caption></caption>
<tr>
<th bgcolor="#cccccc" align="center">
Item
</th>
<th bgcolor="#cccccc" align="center">
Start
</th>
<th bgcolor="#cccccc" align="center">
End
</th>
<th bgcolor="#cccccc" align="center">
Price
</th>

<th bgcolor="#cccccc" align="center">
Title
</th>
<th bgcolor="#cccccc" align="center">
High Bidder / Status
</th>
</tr>
</tr>

<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380910274">220380910274</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 19:45:34</td>
<td align="right"><b>£3.20</b></td>
<td>SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&amp;mode=1&amp;item=220380910274&amp;aid=2&amp;eu=EKfHyOJV8ns5rY4sRZ 0Lacr3gfQXfj%2Ft">Bidder 2</a><img alt="Feedback score is 500 to 999" title="Feedback score is 500 to 999" src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconPurpleStar_25x25.gif" width="25" align="absmiddle" border="0" height="25"><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380924436">220380924436</a></td>

<td align="left">19-Mar-09</td>
<td>29-Mar-09 20:01:51</td>
<td align="right">£3.99</td>
<td>NARUTO ULTIMATE NINJA 2 - PlayStation 2 Game (PS2 PS3)</td>
<td align="left"> No Bids Yet </td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380940639">220380940639</a></td>
<td align="left">19-Mar-09</td>

<td>29-Mar-09 20:20:28</td>
<td align="right"><b>£1.99</b></td>
<td>MONSTER HUNTER PlayStation 2 Game capcom online RPG PS2</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&amp;mode=1&amp;item=220380940639&amp;aid=1&amp;eu=In1%2BJ5pznJvqvcIM IUfUlI3g%2F5YHMy%2F2">Bidder 1</a><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"><img src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconNewId_16x16.gif" alt="New eBay Member (less than 30 days)" width="16" border="0" height="16"><img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>

</table><br>Go see all current <a href="http://cgi6.ebay.co.uk/ws/eBayISAPI.dll?ViewSellersOtherItems&amp;userid=inaudible-games&amp;include=0&amp;sort=3&amp;rows=25&amp;since=-1&amp;rd=1">items for sale</a> by this member.<br><br></center><br /><br />

<a name="rate"></a>Custom text here....


I'm trying to get the code inbetween the <table> after the first paragraph, up to the closing </table> tag for it. There's load of other text and code before and after it too, incase you wondered.



EDIT:
Quote:

sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
is actually almost right. I'm just loosing all the lines inbetween the start and end values, any way to keep em?

EDIT 2: ohhh I love it when things are this simple :) thanks David1357 and pixellany - pixel missed out an N so...

Quote:

sed -n '/SINGSTAR/,/NARUTO/{N s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log

seems to work a treat :) Now to get this all echoed in to a php file properly... tomorrow maybe :)

EDIT: spoke too soon. trying to grab longer bits just messes up, with still more lines being missed off and still weird characters appearing :(

pixellany 03-26-2009 03:46 PM

My code worked for the first sample you gave (blah blah blah).

To help you deal with this, have you dissected the code to see what each piece does?

Here is a quick overview:
Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename
-n Don't print unless told to

/<table>/,/<\/table>/ address range--do the following for every line in this range

{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p} two commands grouped--for every line in the range, do both of these

s/^.*<table>/<table>/p replace everything from the beginning of the line up to and including the LAST "<table"---with "<table>", then print

s/<\/table>.*$/<\/table>/p replace everything from the first "</table>" to the end of the line---with "</table>", then print

Thus, the lines where no substitution is done will not print!!

Try this:
Code:

sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/; s/<\/table>.*$/<\/table>/; p}' filename
Now it should print every line in the range, regardless of whether it was modified.

pepsi_max2k 03-26-2009 04:06 PM

yay, works. i think. hopefully. :) thanks very much. and sorry, should have made the original sample more obvious that there was extra text (and lines) inbetween the stuff I wanted to rip.

And I've got a basic understanding of some basic commands, just didn't really know what to start changing (started with N, quickly figured that was wrong). obviously it was just a case of moving a p around. would have taken me days to figure that out though, so thanks for the help :)



edit:


Quote:

sed -n '/<\/caption>/,/<\/table><br>Go see/{s/^.*<\/caption>/<table>/; s/<\/table><br>Go see.*$/<\/table>/; p}' log

Is perfect for the main code I put above, just needs an  sed'ing out next to the prices and I'm good to go :) Thanks again all.

edit 2: awk it is then.

Quote:

cat log | awk '{ gsub(/Â/, ""); print }' > log2

ghostdog74 03-26-2009 07:34 PM

Quote:

Originally Posted by pepsi_max2k (Post 3488684)

fwiw, if I were to use a purely php method, what sort of things should I be looking at?

with PHP
Code:

preg_match("/<table>(.*?)<\/table>/smi",$string,$s);

pepsi_max2k 03-27-2009 04:00 AM

Cheers Ghost, finally ended up with:

PHP Code:

<?php

// Get HTML from url.
$url "http://members.ebay.co.uk/ws/eBayISAPI.dll?ViewUserPage&userid=MY_USER_ID";
$input = @file_get_contents($url) or die('Could not access file: $url');

// Match section of text in HTML and store in array.
preg_match("/bidders.<br><br><table(.*?)<\/table><br>Go see/smi",$input,$s);

// Remove un-needed text and images.
$preoutput $s[0];
$remove = array('/Â/''/bidders.<br><br>/''/<br>Go see/''/<img(.*?)>/');
$output preg_replace($remove''$preoutput);

// Print text.
echo $output;

?>


Which works a treat. Now I just have to decide if I want this running on every page load, or use the bash way to run via cron once a day :|


All times are GMT -5. The time now is 12:18 AM.