How to strip a specific part of text from a larger file?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to strip a specific part of text from a larger file?
Hey all, I'm trying to wget a file, strip out a part of it, and paste it in to another file. I can do the wget bit easy, but the stripping is kinda harder
And I want to strip that down to just "<table>unique text ... unique text</table>" and I'm not sure where to start.
Using bash, I've tried grep, which is no good as it just gets me a single line and I can say "keep going until you get to this line". head, tail no good as the rest of the line counts will change in the source file. as will the line count of everything between the unique text. awk, sed don't seem like they'll be much help either, or split, or anything i've seen so far.
I have access to php if that helps. Actually, I eventually want to past all the text in to a php file, but that shouldn't be too hard I hope. Anyway, thanks for any pointers.
very close, but it doesn't deal with line breaks. ie.
bleh1 start bleh2 end bleh3
gives "bleh2". but
bleh1
start bleh2
end bleh3
doesn't give anything is there an easy way to solve that? thanks for the tip though
EDIT: getting closer...
Quote:
cat text | sed -n -e ":a" -e "$ s/\n//gp;N;b a" | sed -n 's/^.*<table>unique\(.*\)text<\/table>.*$/\1/p'
sed'ing multiple lines looked hard, so i got rid of the lines instead not exactly sure what it does, and it could mess up in some cases (if theres no white space then words become joined), but in my case it's pretty much ok as anything before and after a line break is html
EDIT 2: meh, that first sed command is stipping a lot more than linebreaks, half the file text is missing halp! please?
EDIT 3:
Quote:
while read line; do echo -n "$line "; done <infile >outfile
helpful, but when I cat the resulting text file there's still lots of bits missing, and sed doesn't see any more than cat
Last edited by pepsi_max2k; 03-26-2009 at 01:00 PM.
If you have PHP, it should be easy to do the whole thing in there. If you really want to use sed, here is a script
Code:
/<table>/,/<\/table>/{
N
s/.*\(<table>\)/\1/
s/\(<\/table>\).*/\1/
s/\n/ /
P
}
This says match across lines starting with "<table>" and ending with "</table>". Suck up the next line. Remove any text before the "<table>" and after the "</table>". Remove any newlines in the match. Print the result to stdout.
Name this pepsi.sed or something equally meaningful. Assuming your aforementioned text is in pepsi.txt, you can use
Code:
[machine:~]:sed -n -f pepsi.sed < pepsi.txt
to produce
Code:
<table>unique text blah blah blah blah blah unique text</table>
The "-n" option to sed tells it to suppress the automatic output of unmatched lines.
Last edited by David1357; 03-26-2009 at 01:03 PM.
Reason: Added explanation of sed code for removing newlines.
Thanks, I'll give it a go in a bit, looks good though
fwiw, if I were to use a purely php method, what sort of things should I be looking at? I just tried something with an array and preg_match_all but I wasn't really getting anywhere, and wasn't sure if it was the right way to go about it anyway so gave up after an hour looking for clues
it's actually to grab some html from one web page and stick it on a different web page, so it'd be simple to do it all in php in that one page if I knew how.
Last edited by pepsi_max2k; 03-26-2009 at 01:43 PM.
Thanks for the help guys, can't seem to get any method to work properly yet but getting closer. I guess I should put the actual code I'm working with below, so excuse the mess.
Running pixellany's code:
Quote:
sed -n '/SINGSTAR/,/singing/{s/^.*SINGSTAR/SINGSTAR/p; s/singing.*$/singing/p}' log
outputs:
Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
SINGSTAR POPWORLD PlayStation 2 Game singing
Should just be "SINGSTAR POPWORLD PlayStation 2 Game singing". And trying to do multiple lines is even worse, missing out loads of text.
Quote:
sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td>NARUTO
David1357's:
Quote:
/SINGSTAR/,/singing/{
N
s/.*\(SINGSTAR\)/\1/
s/\(singing\).*/\1/
s/\n/ /
P
}
Quote:
SINGSTAR POPWORLD PlayStation 2 Game singing
<tr bgcolor="">
<td align="left">19-Mar-09</td>/cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380924436">220380924436</a></td>
<td align="right">£3.99</td>
<td align="left">E NINJNo Bids Yettatio</td>ame (PS2 PS3)</td>
<tr bgcolor="">
and proceeds to output the rest of the file (inc weird character additions, as above)
:| I'm stumped.
Quote:
Custom text.....<br /><br />
<a name="sales"></a><center>
To protect bidder privacy, when the price or highest bid on an item reaches or exceeds a certain level, User IDs will be displayed as anonymous names. For auction items, a bold price means at least one bid has been received. <br><br><b>Note:</b> Anonymous names may appear more than once and may represent different bidders.<br><br><table border="1" width="
concat($Attributes/tablewidth,'%')
" cellpadding="2">
<th bgcolor="#cccccc" align="center">
Title
</th>
<th bgcolor="#cccccc" align="center">
High Bidder / Status
</th>
</tr>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380910274">220380910274</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 19:45:34</td>
<td align="right"><b>£3.20</b></td>
<td>SINGSTAR POPWORLD PlayStation 2 Game singing (PS2 PS3)</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&mode=1&item=220380910274&aid=2&eu=EKfHyOJV8ns5rY4sRZ 0Lacr3gfQXfj%2Ft">Bidder 2</a><img alt="Feedback score is 500 to 999" title="Feedback score is 500 to 999" src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconPurpleStar_25x25.gif" width="25" align="absmiddle" border="0" height="25"><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380924436">220380924436</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 20:01:51</td>
<td align="right">£3.99</td>
<td>NARUTO ULTIMATE NINJA 2 - PlayStation 2 Game (PS2 PS3)</td>
<td align="left"> No Bids Yet </td>
</tr>
<tr bgcolor="">
<td align="left"><a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=220380940639">220380940639</a></td>
<td align="left">19-Mar-09</td>
<td>29-Mar-09 20:20:28</td>
<td align="right"><b>£1.99</b></td>
<td>MONSTER HUNTER PlayStation 2 Game capcom online RPG PS2</td>
<td align="left"><a href="http://offer.ebay.co.uk/ws/eBayISAPI.dll?ViewBidderProfile&mode=1&item=220380940639&aid=1&eu=In1%2BJ5pznJvqvcIM IUfUlI3g%2F5YHMy%2F2">Bidder 1</a><span> <img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"><img src="http://pics.ebaystatic.com/aw/pics/uk/icon/iconNewId_16x16.gif" alt="New eBay Member (less than 30 days)" width="16" border="0" height="16"><img src="http://pics.ebaystatic.com/aw/pics/s.gif" width="4" border="0"></span></td>
</tr>
</table><br>Go see all current <a href="http://cgi6.ebay.co.uk/ws/eBayISAPI.dll?ViewSellersOtherItems&userid=inaudible-games&include=0&sort=3&rows=25&since=-1&rd=1">items for sale</a> by this member.<br><br></center><br /><br />
<a name="rate"></a>Custom text here....
I'm trying to get the code inbetween the <table> after the first paragraph, up to the closing </table> tag for it. There's load of other text and code before and after it too, incase you wondered.
EDIT:
Quote:
sed -n '/SINGSTAR/,/NARUTO/{s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
is actually almost right. I'm just loosing all the lines inbetween the start and end values, any way to keep em?
EDIT 2: ohhh I love it when things are this simple thanks David1357 and pixellany - pixel missed out an N so...
Quote:
sed -n '/SINGSTAR/,/NARUTO/{N s/^.*SINGSTAR/SINGSTAR/p; s/NARUTO.*$/NARUTO/p}' log
seems to work a treat Now to get this all echoed in to a php file properly... tomorrow maybe
EDIT: spoke too soon. trying to grab longer bits just messes up, with still more lines being missed off and still weird characters appearing
Last edited by pepsi_max2k; 03-26-2009 at 03:53 PM.
yay, works. i think. hopefully. thanks very much. and sorry, should have made the original sample more obvious that there was extra text (and lines) inbetween the stuff I wanted to rip.
And I've got a basic understanding of some basic commands, just didn't really know what to start changing (started with N, quickly figured that was wrong). obviously it was just a case of moving a p around. would have taken me days to figure that out though, so thanks for the help
edit:
Quote:
sed -n '/<\/caption>/,/<\/table><br>Go see/{s/^.*<\/caption>/<table>/; s/<\/table><br>Go see.*$/<\/table>/; p}' log
Is perfect for the main code I put above, just needs an  sed'ing out next to the prices and I'm good to go Thanks again all.
edit 2: awk it is then.
Quote:
cat log | awk '{ gsub(/Â/, ""); print }' > log2
Last edited by pepsi_max2k; 03-26-2009 at 04:33 PM.
// Get HTML from url. $url = "http://members.ebay.co.uk/ws/eBayISAPI.dll?ViewUserPage&userid=MY_USER_ID"; $input = @file_get_contents($url) or die('Could not access file: $url');
// Match section of text in HTML and store in array. preg_match("/bidders.<br><br><table(.*?)<\/table><br>Go see/smi",$input,$s);
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.