LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-26-2009, 11:31 AM   #1
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Rep: Reputation: 0
How to strip a specific part of text from a larger file?


Hey all, I'm trying to wget a file, strip out a part of it, and paste it in to another file. I can do the wget bit easy, but the stripping is kinda harder

File will look a little like:

Code:
blah blah blah ablahs fergerg
rgedrgjdfhgblah blah blah <table>unique text blah blah
blah blah blah unique text</table> blah blah
blah blah
And I want to strip that down to just "<table>unique text ... unique text</table>" and I'm not sure where to start.

Using bash, I've tried grep, which is no good as it just gets me a single line and I can say "keep going until you get to this line". head, tail no good as the rest of the line counts will change in the source file. as will the line count of everything between the unique text. awk, sed don't seem like they'll be much help either, or split, or anything i've seen so far.

I have access to php if that helps. Actually, I eventually want to past all the text in to a php file, but that shouldn't be too hard I hope. Anyway, thanks for any pointers.
 
Old 03-26-2009, 12:14 PM   #2
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
This very similar thread may help.
 
Old 03-26-2009, 12:52 PM   #3
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Original Poster
Rep: Reputation: 0
very close, but it doesn't deal with line breaks. ie.

bleh1 start bleh2 end bleh3

gives "bleh2". but

bleh1
start bleh2
end bleh3

doesn't give anything is there an easy way to solve that? thanks for the tip though


EDIT: getting closer...


Quote:
cat text | sed -n -e ":a" -e "$ s/\n//gp;N;b a" | sed -n 's/^.*<table>unique\(.*\)text<\/table>.*$/\1/p'

sed'ing multiple lines looked hard, so i got rid of the lines instead not exactly sure what it does, and it could mess up in some cases (if theres no white space then words become joined), but in my case it's pretty much ok as anything before and after a line break is html

EDIT 2: meh, that first sed command is stipping a lot more than linebreaks, half the file text is missing halp! please?

EDIT 3:

Quote:
while read line; do echo -n "$line "; done <infile >outfile
helpful, but when I cat the resulting text file there's still lots of bits missing, and sed doesn't see any more than cat

Last edited by pepsi_max2k; 03-26-2009 at 02:00 PM.
 
Old 03-26-2009, 02:02 PM   #4
David1357
Senior Member
 
Registered: Aug 2007
Location: South Carolina, U.S.A.
Distribution: Ubuntu, Fedora Core, Red Hat, SUSE, Gentoo, DSL, coLinux, uClinux
Posts: 1,302
Blog Entries: 1

Rep: Reputation: 107Reputation: 107
Quote:
Originally Posted by pepsi_max2k View Post
I have access to php if that helps.
If you have PHP, it should be easy to do the whole thing in there. If you really want to use sed, here is a script
Code:
/<table>/,/<\/table>/{
    N
    s/.*\(<table>\)/\1/
    s/\(<\/table>\).*/\1/
    s/\n/ /
    P
}
This says match across lines starting with "<table>" and ending with "</table>". Suck up the next line. Remove any text before the "<table>" and after the "</table>". Remove any newlines in the match. Print the result to stdout.

Name this pepsi.sed or something equally meaningful. Assuming your aforementioned text is in pepsi.txt, you can use
Code:
[machine:~]:sed -n -f pepsi.sed < pepsi.txt
to produce
Code:
<table>unique text blah blah blah blah blah unique text</table>
The "-n" option to sed tells it to suppress the automatic output of unmatched lines.

Last edited by David1357; 03-26-2009 at 02:03 PM. Reason: Added explanation of sed code for removing newlines.
 
Old 03-26-2009, 02:42 PM   #5
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks, I'll give it a go in a bit, looks good though

fwiw, if I were to use a purely php method, what sort of things should I be looking at? I just tried something with an array and preg_match_all but I wasn't really getting anywhere, and wasn't sure if it was the right way to go about it anyway so gave up after an hour looking for clues

it's actually to grab some html from one web page and stick it on a different web page, so it'd be simple to do it all in php in that one page if I knew how.

Last edited by pepsi_max2k; 03-26-2009 at 02:43 PM.
 
Old 03-26-2009, 03:07 PM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729
Code:
sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename
Note: Those are not capital Vs---the "/" in "</table>" has to be escaped because "/" is also the sed s delimiter.

Last edited by pixellany; 03-26-2009 at 03:09 PM.
 
Old 03-26-2009, 04:17 PM   #7
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks for the help guys, can't seem to get any method to work properly yet but getting closer. I guess I should put the actual code I'm working with below, so excuse the mess.

Running pixellany's code:

Quote:
sed -n '/SINGSTAR/,/singing/{s/^.*SINGSTAR/SINGSTAR/p; s/singing.*$/singing/p}' log
outputs:

Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
SINGSTAR POPWORLD PlayStation 2 Game singing
Should just be "SINGSTAR POPWORLD PlayStation 2 Game singing". And trying to do multiple lines is even worse, missing out loads of text.

Quote:
sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td>NARUTO







David1357's:

Quote:
/SINGSTAR/,/singing/{
N
s/.*\(SINGSTAR\)/\1/
s/\(singing\).*/\1/
s/\n/ /
P
}

Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing
<tr bgcolor="">
<td align="left">19-Mar-09</td>/cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380924436">220380924436</a></td>
<td align="right">£3.99</td>
<td align="left">E NINJNo Bids Yettatio</td>ame (PS2 PS3)</td>
<tr bgcolor="">
and proceeds to output the rest of the file (inc weird character additions, as above)

:| I'm stumped.



Quote:
Custom text.....<br /><br />

<a name="sales"></a><center>
To protect bidder privacy, when the price or highest bid on an item reaches or exceeds a certain level, User IDs will be displayed as anonymous names. For auction items, a bold price means at least one bid has been received. <br><br><b>Note:</b> Anonymous names may appear more than once and may represent different bidders.<br><br><table border="1" width="
concat($Attributes/tablewidth,'%')
" cellpadding="2">

<caption></caption>
<tr>
<th bgcolor="#cccccc" align="center">
Item
</th>
<th bgcolor="#cccccc" align="center">
Start
</th>
<th bgcolor="#cccccc" align="center">
End
</th>
<th bgcolor="#cccccc" align="center">
Price
</th>

<th bgcolor="#cccccc" align="center">
Title
</th>
<th bgcolor="#cccccc" align="center">
High Bidder / Status
</th>
</tr>
</tr>

<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380910274">220380910274</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 19:45:34</td>
<td align="right"><b>3.20</b></td>
<td>SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&amp;mode=1&amp;item=220380910274&amp;aid=2&amp;eu=EKfHyOJV8ns5rY4sRZ 0Lacr3gfQXfj%2Ft">Bidder 2</a><img alt="Feedback score is 500 to 999" title="Feedback score is 500 to 999" src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconPurpleStar_25x25.gif" width="25" align="absmiddle" border="0" height="25"><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380924436">220380924436</a></td>

<td align="left">19-Mar-09</td>
<td>29-Mar-09 20:01:51</td>
<td align="right">3.99</td>
<td>NARUTO ULTIMATE NINJA 2 - PlayStation 2 Game (PS2 PS3)</td>
<td align="left"> No Bids Yet </td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&amp;item=220380940639">220380940639</a></td>
<td align="left">19-Mar-09</td>

<td>29-Mar-09 20:20:28</td>
<td align="right"><b>1.99</b></td>
<td>MONSTER HUNTER PlayStation 2 Game capcom online RPG PS2</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&amp;mode=1&amp;item=220380940639&amp;aid=1&amp;eu=In1%2BJ5pznJvqvcIM IUfUlI3g%2F5YHMy%2F2">Bidder 1</a><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"><img src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconNewId_16x16.gif" alt="New eBay Member (less than 30 days)" width="16" border="0" height="16"><img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>

</table><br>Go see all current <a href="http://cgi6.ebay.co.uk/ws/eBayISAPI.dll?ViewSellersOtherItems&amp;userid=inaudible-games&amp;include=0&amp;sort=3&amp;rows=25&amp;since=-1&amp;rd=1">items for sale</a> by this member.<br><br></center><br /><br />

<a name="rate"></a>Custom text here....


I'm trying to get the code inbetween the <table> after the first paragraph, up to the closing </table> tag for it. There's load of other text and code before and after it too, incase you wondered.



EDIT:
Quote:
sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
is actually almost right. I'm just loosing all the lines inbetween the start and end values, any way to keep em?

EDIT 2: ohhh I love it when things are this simple thanks David1357 and pixellany - pixel missed out an N so...

Quote:
sed -n '/SINGSTAR/,/NARUTO/{N s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log

seems to work a treat Now to get this all echoed in to a php file properly... tomorrow maybe

EDIT: spoke too soon. trying to grab longer bits just messes up, with still more lines being missed off and still weird characters appearing

Last edited by pepsi_max2k; 03-26-2009 at 04:53 PM.
 
Old 03-26-2009, 04:46 PM   #8
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729
My code worked for the first sample you gave (blah blah blah).

To help you deal with this, have you dissected the code to see what each piece does?

Here is a quick overview:
Code:
sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p}' filename
-n Don't print unless told to

/<table>/,/<\/table>/ address range--do the following for every line in this range

{s/^.*<table>/<table>/p; s/<\/table>.*$/<\/table>/p} two commands grouped--for every line in the range, do both of these

s/^.*<table>/<table>/p replace everything from the beginning of the line up to and including the LAST "<table"---with "<table>", then print

s/<\/table>.*$/<\/table>/p replace everything from the first "</table>" to the end of the line---with "</table>", then print

Thus, the lines where no substitution is done will not print!!

Try this:
Code:
sed -n '/<table>/,/<\/table>/{s/^.*<table>/<table>/; s/<\/table>.*$/<\/table>/; p}' filename
Now it should print every line in the range, regardless of whether it was modified.
 
Old 03-26-2009, 05:06 PM   #9
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Original Poster
Rep: Reputation: 0
yay, works. i think. hopefully. thanks very much. and sorry, should have made the original sample more obvious that there was extra text (and lines) inbetween the stuff I wanted to rip.

And I've got a basic understanding of some basic commands, just didn't really know what to start changing (started with N, quickly figured that was wrong). obviously it was just a case of moving a p around. would have taken me days to figure that out though, so thanks for the help



edit:


Quote:
sed -n '/<\/caption>/,/<\/table><br>Go see/{s/^.*<\/caption>/<table>/; s/<\/table><br>Go see.*$/<\/table>/; p}' log

Is perfect for the main code I put above, just needs an sed'ing out next to the prices and I'm good to go Thanks again all.

edit 2: awk it is then.

Quote:
cat log | awk '{ gsub(//, ""); print }' > log2

Last edited by pepsi_max2k; 03-26-2009 at 05:33 PM.
 
Old 03-26-2009, 08:34 PM   #10
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by pepsi_max2k View Post

fwiw, if I were to use a purely php method, what sort of things should I be looking at?
with PHP
Code:
preg_match("/<table>(.*?)<\/table>/smi",$string,$s);
 
Old 03-27-2009, 05:00 AM   #11
pepsi_max2k
LQ Newbie
 
Registered: Apr 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Cheers Ghost, finally ended up with:

PHP Code:
<?php

// Get HTML from url.
$url "http://members.ebay.co.uk/ws/eBayISAPI.dll?ViewUserPage&userid=MY_USER_ID";
$input = @file_get_contents($url) or die('Could not access file: $url');

// Match section of text in HTML and store in array.
preg_match("/bidders.<br><br><table(.*?)<\/table><br>Go see/smi",$input,$s);

// Remove un-needed text and images.
$preoutput $s[0];
$remove = array('//''/bidders.<br><br>/''/<br>Go see/''/<img(.*?)>/');
$output preg_replace($remove''$preoutput);

// Print text.
echo $output;

?>

Which works a treat. Now I just have to decide if I want this running on every page load, or use the bash way to run via cron once a day :|

Last edited by pepsi_max2k; 03-27-2009 at 06:07 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
write text to specific location in file mcbenus Linux - Desktop 3 02-28-2008 07:40 AM
A program capable to download a specific part of file? Vitalie Ciubotaru Linux - Software 4 11-16-2006 09:46 PM
Writing to a specific line in a text file mrobertson Programming 6 12-30-2005 05:02 PM
SED - display text on specific line of text file 3saul Linux - Software 3 12-29-2005 05:32 PM
How to find and change a specific text in a text file by using shell script Bassam Programming 1 07-18-2005 08:15 PM


All times are GMT -5. The time now is 05:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration