LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-06-2011, 08:59 PM   #1
ted_chou12
Member
 
Registered: Aug 2010
Location: Zhongli, Taoyuan
Distribution: slackware, windows, debian (armv4l GNU/Linux)
Posts: 425
Blog Entries: 28

Rep: Reputation: 2
sed match html content (multiple lines)


Hi, I asked a similar question this afternoon with a single line, but I want to match multiple lines this time:
Quote:
<div class="mn">
<eoaitehoait html input textarea></atae.t.awet.awe
t.>all the content I want................yes here.
random tags with unknown number of lines.s..s..s.<p><font></font>
<p>....</p>
<p>....QRWRW@$@</p>
<p>♫♪♬</p>
<h1 class="mt">random content here</h1>
starts with this line:<div class="mn"> (I am pretty sure this is the only content of this line)
ends with this line:<h1 class="mt">random content here</h1>. (This is the only content of this line)
I came up with
Code:
sed -rn 's@.*<div class="mn">\(.*\)<h1 class="mt">.*</h1>.*@\1@p'
But nothing appears, This Is Multiple Line.
Thanks,
Ted

Last edited by ted_chou12; 12-06-2011 at 09:00 PM.
 
Old 12-07-2011, 02:38 PM   #2
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
What do you want to do with the lines when you find them?

You're after the hold buffer, check out range addressing and the h, H and g operators.
 
Old 12-07-2011, 03:40 PM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959
Please use [code][/code] tags around all your code and data, to preserve formatting and to improve readability. DO NOT USE QUOTE TAGS, as they don't preserve whitespace.

The final comment I made in your last thread talks about sed's address fields (and provided some links as well). You have to use them in order to match a multi-line segment like this. Set the first address to match the first line you want, and the second address to match the last line, and tell it to print the results.

Code:
sed -n '\|<div class="mn">|,\|<h1 class="mt">|p' file.html
One weakness of this however is that if there are multiple matches of the block in the file, it will print them all. The only way I know of to stop it is to add a nested sub-expression to re-match the last line and terminate the command after it's done printing.

Code:
sed -n '\|<div class="mn">|,\|<h1 class="mt">| {p ; {\|<h1 class="mt">| q } }' file.html
And I'm not at all sure what you'd need to do if you wanted to grab anything other than the first match.

Overall, using sed for multi-line processing is a real headache; it's just not well-designed for such things. And more generally, using any of the standard pattern matching tools on html or xml is rather tricky. You may be better off using something that has a dedicated html parser. I'm sure you could find several useful perl modules, for example.
 
1 members found this post helpful.
Old 12-07-2011, 06:41 PM   #4
ted_chou12
Member
 
Registered: Aug 2010
Location: Zhongli, Taoyuan
Distribution: slackware, windows, debian (armv4l GNU/Linux)
Posts: 425
Blog Entries: 28

Original Poster
Rep: Reputation: 2
Thanks,
@jthill, I know the first and the last line for sure, but I want the content in between those.
@David, Thanks, but I want the lines in between the first and the last line in my first post none inclusive.
So what I want is:
Code:
<h1 class="mt">random content</h1>
The content I want is here.
<p>random number of lines and text</p>
<p>....</p>
...
...
instead of
Code:
<div class="mn">
<h1 class="mt">FIXED content</h1>
from here:
Code:
<other html content ahead>
<div class="mn">
<h1 class="mt">random content</h1>
The content I want is here.
<p>random number of lines and text</p>
<p>....</p>
...
...
<h1 class="mt">FIXED content</h1>
<other html content after>
FIXED content can be assumed to be the actual content in code:
Code:
sed -n 's@<div class="mn">\(.*\)<h1 class="mt">FIXED content<\/h1>@\1@p' /tmp/signin.html.tmp
I have attached the sample html code (sorry i have to change the extension to log).
I want the content inbetween those two lines (noinclusive)
Thanks,
Ted
Attached Files
File Type: log signin.html.log (20.3 KB, 11 views)

Last edited by ted_chou12; 12-07-2011 at 07:13 PM.
 
Old 12-08-2011, 12:51 AM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959
Excluding the address lines themselves requires another nested layer of commands with their own address ranges, to eliminate the lines you don't want from the first match.

http://sed.sourceforge.net/sedfaq4.html#s4.24

Code:
sed -n '\|<div class="mn">|,\|<h1 class="mt">| { \|<div class="mn">|b ; \|<h1 class="mt">|b; p }' file.html
The b is the branching command; jump to a specified point in the expression. But since I didn't define a target to jump to, it defaults to the end of the expression, and so it effectively means "ignore this line and go on to the next".

And to tell it to quit after the first match, replace the second b with the q command, as before.


Code:
sed -n '\|<div class="mn">|,\|<h1 class="mt">| {\|<div class="mn">|b;\|<h1 class="mt">|q; p}' file.html
Here's my list of sed references again.

http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

I suggest you take the time to work through the first one, and learn all the single-line features at least. But the multi-line commands that use the hold buffer are harder to master; I've been working at it myself for a while now and I'm still not all that good at it. I usually switch to awk when I need that level of complexity.
 
1 members found this post helpful.
Old 12-08-2011, 01:25 AM   #6
ted_chou12
Member
 
Registered: Aug 2010
Location: Zhongli, Taoyuan
Distribution: slackware, windows, debian (armv4l GNU/Linux)
Posts: 425
Blog Entries: 28

Original Poster
Rep: Reputation: 2
Thanks David, I probably need to take some time to read thru the sed introduction.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] match html with sed ted_chou12 Linux - Newbie 7 12-06-2011 03:22 PM
[SOLVED] sed: Match one line, make a substitution a few lines down? ShadowCat8 Programming 6 06-08-2011 07:59 PM
[SOLVED] use sed in bash to match pattern contained in 2 lines ghantauke Linux - Newbie 3 03-16-2011 10:34 AM
How to use sed to delete all lines before the first match of a pattern? C_Blade Linux - Newbie 9 05-01-2010 04:18 AM
sed match last x lines of a file bradvan Programming 12 03-19-2009 11:18 PM


All times are GMT -5. The time now is 10:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration