LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-24-2012, 06:53 PM   #1
ted_chou12
Member
 
Registered: Aug 2010
Location: Zhongli, Taoyuan
Distribution: slackware, windows, debian (armv4l GNU/Linux)
Posts: 431
Blog Entries: 32

Rep: Reputation: 3
sed match and ignore new lines


Hi, I am trying to make this match sed to ignore new lines:
Code:
torrentlink=($(sed -rn '/target=/s/.*href="([^"]+)".*>.*<\/a.*/\1/p' "$html"))
It would perfectly match contents like this:
Code:
<a style="color: white;" href="http://www.p2pnow.net" target="_blank">p2pnow.net</a>
p2pnow.net
but in html code that has new lines in it:
Code:
<a href="/topics/view/168577_Dymy_10_Fairy_Tail_23_BIG5_1024X576_RMVB.html"  target="_blank" >
				【Dy】【10】【<span class="keyword">Fairy</span> <span class="keyword">Tail</span>_】【】【BIG5】【1024X576】【<span class="keyword">RMVB</span>】</a>
This would not match anything. But I wish it to match:

【Dy】【10】【<span class="keyword">Fairy</span> <span class="keyword">Tail</span>_】【】【BIG5】【1024X576】【<span class="keyword">RMVB</span>】
Thanks,
Ted
 
Old 02-24-2012, 08:36 PM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Sed is not good at multiline matches... You may try to remove all newlines from input or try the following
Code:
sed -rn '/target=/{:a; /<\/a>/!{N;ba}; s/.*href="([^"]+)".*>.*<\/a.*/\1/p}'
Bolded commands should append lines to pattern space until '</a>' is found.

Hope that helps.
 
Old 02-24-2012, 11:35 PM   #3
ted_chou12
Member
 
Registered: Aug 2010
Location: Zhongli, Taoyuan
Distribution: slackware, windows, debian (armv4l GNU/Linux)
Posts: 431

Original Poster
Blog Entries: 32

Rep: Reputation: 3
Hi, thanks for your help. The command seems to return the href="value" rather than the content between <a>value</a>. The later one is the one i am searching for. I have tried to use tr to remove new lines before passing throught the command, but it seems to slow everything down.
Thanks,
Ted
 
Old 02-25-2012, 12:22 AM   #4
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Code:
$ cat infile.txt 
<a href="/topics/view/168577_Dymy_10_Fairy_Tail_23_BIG5_1024X576_RMVB.html"  target="_blank" >
【Dy】【10】【<span class="keyword">Fairy</span> <span class="keyword">Tail</span>_】【】【BIG5】【1024X576】【<span class="keyword">RMVB</span>】</a>
$ sed -rn '/target=/{:a; /<\/a>/!{N;ba}; s/\n//g; s/.*href="[^"]+"[^>]*>(.*)<\/a.*/\1/p}' infile.txt 
【Dy】【10】【<span class="keyword">Fairy</span> <span class="keyword">Tail</span>_】【】【BIG5】【1024X576】【<span class="keyword">RMVB</span>】
Modifications are in bold font. I also added a code to remove newlines.

Last edited by firstfire; 02-25-2012 at 12:24 AM.
 
1 members found this post helpful.
Old 02-25-2012, 08:37 AM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
I would suggest not using regex to parse html. Here is a possible solution with xmlstarlet:
Code:
xml fo --html "$html" | xml sel -t -m '//a/node()' -c . 
# -T or --text will drop the <span>s
xml fo --html "$html" | xml sel -T -t -m '//a/node()' -c .
 
1 members found this post helpful.
Old 02-26-2012, 11:33 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
As ntubski's link makes clear, the problem with html and xml is that their open, nested structures are inherently difficult to process with regular expressions and tools that are designed to work line-by-line such as awk and sed. For complex jobs you should really be using something with a dedicated parser.

That said, I believe it's still ok to use sed/awk/whatever for simple parsing on known, cleanly-structured html. But to do so you should start by using a tool like htmltidy to clean up and regularize the formatting, so that it can be more easily handled by these tools. Start by getting everything on one line before you try any extracting.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed match html content (multiple lines) ted_chou12 Linux - Newbie 5 12-08-2011 01:25 AM
[SOLVED] sed: Match one line, make a substitution a few lines down? ShadowCat8 Programming 6 06-08-2011 07:59 PM
[SOLVED] use sed in bash to match pattern contained in 2 lines ghantauke Linux - Newbie 3 03-16-2011 10:34 AM
How to use sed to delete all lines before the first match of a pattern? C_Blade Linux - Newbie 9 05-01-2010 04:18 AM
sed match last x lines of a file bradvan Programming 12 03-19-2009 11:18 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration