[SOLVED] How can I use sed to match this?

ted_chou12 · 12-05-2011, 02:32 AM

Hi, how can I use sed to match this text string:

Code:

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_%number%" onmouseover="showMenu(%text%)">

<a href="%variable%" target="_blank">%string%.torrent</a>

<em class="xg1">

Hi, the variable parts are represented by %number%, %text%. They can be any length. %number% only has a string of numbers like 21414. Where as %text% and %string% can match any length of string of text or number. The part that I want to extract is %variable% and %string% separately.

Code:

array=(echo "$string" | sed '<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_[0-9]" onmouseover="showMenu\((*.)\)">

<a href="$1" target="_blank">$2.torrent</a>

<em class="xg1">')

for $vars in $array ..
this part is ignored.

Would it be something like this?
Thanks,
Ted

jhwilliams · 12-05-2011, 02:47 AM

Just match the <a> tag, then.

Code:

cat foo.html | sed -r 's@<a href="(.*)" .*>(.*).torrent</a>@variable=\1, string=\2@'

Explanation: the sed command has search and replace parts, broken up by @ chars.

Search looks for
<a href="(.*)" .*>(.*).torrent</a>

Which saves the href target, and the contents of the a tag itself (by using the parentheses.)

Next, the replace statement references those matches in order as \1 and \2.

ted_chou12 · 12-05-2011, 03:05 AM

Quote:

Originally Posted by jhwilliams

Just match the <a> tag, then.

Code:

cat foo.html | sed -r 's@<a href="(.*)" .*>(.*).torrent</a>@variable=\1, string=\2@'

Explanation: the sed command has search and replace parts, broken up by @ chars.

Search looks for
<a href="(.*)" .*>(.*).torrent</a>

Which saves the href target, and the contents of the a tag itself (by using the parentheses.)

Next, the replace statement references those matches in order as \1 and \2.

Hi, thanks, can you explain how I could extract $1 and $2?
I tried

Code:

echo $(cat foo.html | sed -r 's@<a href="(.*)" .*>(.*).torrent</a>@variable=\1, string=\2@')

It seems to output the whole page a lot of times.
Thanks,
Ted

jhwilliams · 12-05-2011, 03:15 AM

Oh right, right. Try grepping for the <a href line first.

ted_chou12 · 12-05-2011, 03:40 AM

Hi, would that be this:

Code:

echo $(cat "aa.html" | grep "^<a href=" |sed -r 's@<a href="(.*)" .*>(.*).torrent</a>@variable=\1, string=\2@')

Thanks,
Ted

jhwilliams · 12-05-2011, 03:45 AM

Quote:

Originally Posted by ted_chou12

Hi, would that be this:

Code:

echo $(cat "aa.html" | grep "^<a href=" |sed -r 's@<a href="(.*)" .*>(.*).torrent</a>@variable=\1, string=\2@')

Thanks,
Ted

That's close, but you're grep match isn't what you want. As is, you're looking for lines that start with <a href=". There will probably be other stuff before you hit the <a> tag. So, maybe just remove the ^. Or, account for whatever you expect to find before the tag.

ted_chou12 · 12-05-2011, 03:56 AM

Hi, the output is

Quote:

XXXTOP Part of HTML
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
variable
variable
variable
variable,text
XXXXXXXXXXXXXXBottom part of HTML

So it isn't matching all of it, or at least, not partially extracting. I am guessing because I have multiple ones to extract within one page. How would I go about doing this?
Thanks,
Ted

ted_chou12 · 12-05-2011, 04:48 AM

The page I wish to extract is:

Code:

<div id="wp" class="wp"><script type="text/javascript">var fid = parseInt('108'), tid = parseInt('147178');</script>

<script src="static/js/forum_viewthread.js?6vP" type="text/javascript"></script>
<script type="text/javascript">zoomstatus = parseInt(1);var imagemaxwidth = '750';var aimgcount = new Array();</script>

<style id="diy_style" type="text/css"></style>
<!--[diy=diynavtop]--><div id="diynavtop" class="area"></div><!--[/diy]-->
<div id="pt" class="bm cl">
<div class="z">
<a href="./" class="nvhm" title=""></a> <em>&rsaquo;</em> <a href="forum.php"></a> <em>&rsaquo;</em> <a href="forum.php?gid=107"></a> <em>&rsaquo;</em> <a href="forum.php?mod=forumdisplay&fid=108&page=1"></a> <em>&rsaquo;</em> <a href="forum.php?mod=viewthread&amp;tid=147178">[5/12] [...</a>
</div>
</div>

<style id="diy_style" type="text/css"></style>
<div class="wp">
<!--[diy=diy1]--><div id="diy1" class="area"></div><!--[/diy]-->
</div>

<div id="ct" class="wp cl">
<div id="pgt" class="pgs mbm cl ">
<div class="pgt"><div class="pg"><strong>1</strong><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=2">2</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=3">3</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=4">4</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=5">5</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=6">6</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=7">7</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=8">8</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=9">9</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=10">10</a><a href="forum.php?mod=viewthread&tid=147178&amp;extra=page%3D1&amp;page=2" class="nxt">下一頁</a></div></div>
<span class="y pgb" id="visitedforums" onmouseover="$('visitedforums').id = 'visitedforumstmp';this.id = 'visitedforums';showMenu({'ctrlid':this.id,'pos':'34'})"><a href="forum.php?mod=forumdisplay&fid=108&page=1"></a></span>
<a id="newspecial" onmouseover="$('newspecial').id = 'newspecialtmp';this.id = 'newspecial';showMenu({'ctrlid':this.id})" onclick="showWindow('newthread', 'forum.php?mod=post&action=newthread&fid=108')" href="javascript:;" title="發新帖"><img src="static/image/common/pn_post.png" alt="" /></a><a id="post_reply" onclick="showWindow('reply', 'forum.php?mod=post&action=reply&fid=108&tid=147178')" href="javascript:;" title=""><img src="static/image/common/pn_reply.png" alt="" /></a>
</div>



<div id="postlist" class="pl bm">
<table cellspacing="0" cellpadding="0">
<tr>
<td class="pls ptm pbm">
<div class="hm">
<span class="xg1"></span> <span class="xi1">11183</span><span class="pipe">|</span><span class="xg1">:</span> <span class="xi1">188</span>
</div>
<tr>
<td class="pls" rowspan="2">
 <div class="pi">
<div class="authi"><a href="home.php?mod=space&amp;uid=1" target="_blank" class="xw1">pieayu</a>

</div>
</div>
<div class="p_pop blk bui" id="userinfo2721324" style="display: none; margin-top: -11px;">
<div class="m z">
<div id="userinfo2721324_ma"></div>
</div>
<div class="i y">
<div>
<strong><a href="home.php?mod=space&amp;uid=1" target="_blank" class="xi2">pieayu</a></strong>
</p>
<ul class="xl xl2 o cl">
<li class="buddy"><a href="home.php?mod=spacecp&amp;ac=friend&amp;op=add&amp;uid=1&amp;handlekey=addfriendhk_1" id="a_friend_li_2721324" onclick="showWindow(this.id, this.href, 'get', 1, {'ctrlid':this.id,'pos':'00'});" title=></a></li>
<li class="poke2"><a href="home.php?mod=spacecp&amp;ac=poke&amp;op=send&amp;uid=1" id="a_poke_li_2721324" onclick="showWindow(this.id, this.href, 'get', 0);" title="" class="xi2"></a></li>
<li class="pm2"><a href="home.php?mod=spacecp&amp;ac=pm&amp;op=showmsg&amp;handlekey=showmsg_1&amp;touid=1&amp;pmid=0&amp;daterange=2&amp;pid=2721324&amp;tid=147178" onclick="showWindow('sendpm', this.href);" title="" class="xi2"></a></li>
</ul>
</td>
<td class="plc">
<div class="pi">
<div id="fj" class="y">
<label class="z"></label>
<input type="text" class="px p_fre z" size="2" onkeyup="$('fj_btn').href='forum.php?mod=redirect&ptid=147178&authorid=0&postno='+this.value" onkeydown="if(event.keyCode==13) {window.location=$('fj_btn').href;return false;}" title="" id="postnum2721324" onclick="setCopy(this.href, '');return false;"><em>1</em><sup>#</sup></a>
</strong>
<div class="pti">
<div class="pdbt">
</div>
<div class="authi">
<img class="authicn vm" id="authicon2721324" src="static/image/common/ayu_icon.gif" />
<em id="authorposton2721324"> 2011-10-9 17:03:38</em>
<span class="pipe">|</span><a href="forum.php?mod=viewthread&amp;tid=147178&amp;page=1&amp;authorid=1" rel="nofollow"></a>
<span class="pipe">|</span><a href="forum.php?mod=viewthread&amp;tid=147178&amp;extra=page%3D1&amp;ordertype=1"></a>
</div>
</div>
</div><div class="pct"><style type="text/css">.pcb{margin-right:0}</style><div class="pcb">
<div class="t_fsz">
<table cellspacing="0" cellpadding="0"><tr><td class="t_f" id="postmessage_2721324">
<img src="http://img165.poco.cn/mypoco/myphoto/20111010/04/5536770720111010042551037.jpg" onload="thumbImg(this)" alt="" /><br />
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_65391" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjUzOTF8OGM1YTM3MGF8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][OAD][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(13.74 KB, 下載次數: 839)
</em>
</span>
<div class="tip tip_4" id="attach_65391_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-10-9 17:05 上傳</div>
下載次數: 839

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_65390" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjUzOTB8Njk3ZTZhYjd8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][01][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.22 KB, 下載次數: 1339)
</em>
</span>
<div class="tip tip_4" id="attach_65390_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-10-9 17:03 上傳</div>
下載次數: 1339

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_65689" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjU2ODl8NGUwNzEzZTN8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][02][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(19.97 KB, 下載次數: 1198)
</em>
</span>
<div class="tip tip_4" id="attach_65689_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-10-16 17:03 上傳</div>
下載次數: 1198

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_65972" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjU5NzJ8ZmNlNDY3NzV8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][03][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.76 KB, 下載次數: 1086)
</em>
</span>
<div class="tip tip_4" id="attach_65972_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-10-23 17:04 上傳</div>
下載次數: 1086

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_66180" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjYxODB8YTIwNzBiYjN8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][04][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.55 KB, 下載次數: 1103)
</em>
</span>
<div class="tip tip_4" id="attach_66180_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-10-30 20:55 上傳</div>
下載次數: 1103

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_66441" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjY0NDF8ZGI0OTIwYjV8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][05][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(20.18 KB, 下載次數: 1029)
</em>
</span>
<div class="tip tip_4" id="attach_66441_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-11-6 17:05 上傳</div>
下載次數: 1029

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_66721" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjY3MjF8YmQ1OTYzZTJ8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][06][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.56 KB, 下載次數: 996)
</em>
</span>
<div class="tip tip_4" id="attach_66721_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-11-13 17:08 上傳</div>
下載次數: 996

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_66986" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjY5ODZ8M2NjYzQ0OTZ8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][07][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.66 KB, 下載次數: 995)
</em>
</span>
<div class="tip tip_4" id="attach_66986_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-11-20 17:07 上傳</div>
下載次數: 995

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_67283" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=NjcyODN8MDQyNmI0NGR8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][08][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(19.18 KB, 下載次數: 862)
</em>
</span>
<div class="tip tip_4" id="attach_67283_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-11-27 17:05 上傳</div>
下載次數: 862

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />

<ignore_js_op>

<img src="static/image/filetype/torrent.gif" border="0" class="vm" alt="" />
<span style="white-space: nowrap" id="attach_67658" onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})">

<a href="forum.php?mod=attachment&amp;aid=Njc2NTh8YTkyY2EwMjd8MTMyMzA4MTY4NHw5MDM3MnwxNDcxNzg%3D" target="_blank">[DMG][Mirai nikki][09][848x480][BIG5].rmvb.torrent</a>

<em class="xg1">(17.46 KB, 下載次數: 306)
</em>
</span>
<div class="tip tip_4" id="attach_67658_menu" style="position: absolute; display: none">
<div class="tip_c xs0">
<div class="y">2011-12-4 17:09 上傳</div>
下載次數: 306

</div>
<div class="tip_horn"></div>
</div>
</ignore_js_op>
<br />
<br />
<br />
<font size="4"><a href="forum.php?mod=viewthread&amp;tid=144252" target="_blank">http://pieayu.com/forum.php?mod=viewthread&amp;tid=144252</a></font></td></tr></table>
</div>
<div id="comment_2721324" class="cm">
</div>
<div id="post_rate_div_2721324"></div>
</div></div>

</td></tr>
<tr><td class="plc plm">
<div class="modact"><a href="forum.php?mod=misc&amp;action=viewthreadmod&amp;tid=147178" title="帖子模式" onclick="showWindow('viewthreadmod', .........................................a lot more

Sorry for the mandarine within.
Thanks,Ted

colucix · 12-05-2011, 06:26 AM

I would keep it simple and use three different sed commands to retrieve the three different items. You might store the result into arrays, then loop over their content, e.g.

Code:

#!/bin/bash
OLD_IFS=${IFS}
IFS=$'\n'
num=( $(sed -rn '/id=.*onmouseover/s/.*attach_([0-9]+).*/\1/p' file) )
text=( $(sed -rn '/onmouseover=/s/.*onmouseover="showMenu\((.*)\).*/\1/p' file) )
string=( $(sed -rn '/target=/s/.*>(.*).torrent.*/\1/p' file) )
for i in $(seq 0 $((${#num[@]}-1)))
do
  echo ${num[$i]}
  echo ${text[$i]}
  echo ${string[$i]}
done
IFS=${OLD_IFS}

The replacement and the subsequent restore of the IFS variable is due to blank spaces in the results of the sed command (in particular the torrent file names contain spaces). Hope this helps.

jschiwal · 12-05-2011, 07:19 AM

You can use -e or ";" to separate sed commands. You don't want to run sed 3 times.

This seems to work:
sed -n '/href=.*target="_blank"/s|.*<a href="$.*$" target="_blank">$.*$.torrent<\/a>|variable=\1 string=\2|p' aa.html

The first part "/.../" matches patterns for the rest of sed to work with.
The -n option causes sed to not output lines unless you use the "p" command. This allows us to only output lines that match.

For much more complicated sed programs, create a file with the sed commands and use "sed -f sedprogram file"

ted_chou12 · 12-05-2011, 09:20 AM

Thanks,
@jschiwal that gave perfect outcome.
@colucix, thanks, it did gave me the perfect %string%, %text% and %number%, but I was looking for %variable% and %string%. I tried to modify the code slightly to work, but I am quite a rookieXD, here is what I tried:

Code:

OLD_IFS=${IFS}
IFS=$'\n'
num=( $(sed -rn '/id=.*onmouseover/s/.*attach_([0-9]+).*/\1/p' aa.html) )
var=( $(sed -rn '/a\shref="(.*)"/\1/p' aa.html) )
string=( $(sed -rn '/target=/s/.*>(.*).torrent.*/\1/p' aa.html) )
for i in $(seq 0 $((${#num[@]}-1)))
do
  echo ${num[$i]}
  echo ${var[$i]}
  echo ${string[$i]}
done
IFS=${OLD_IFS}

would you guide me in the correct direction for this to work too? (I wish to learn how to use sed better.) BTW, I learnt a new use of IFS from your code.
Thanks,
Ted

colucix · 12-05-2011, 10:20 AM

Slightly modified:

Code:

#!/bin/bash
OLD_IFS=${IFS}
IFS=$'\n'
variable=( $(sed -rn '/target=/s/.*href="([^"]+)".*>.*.torrent.*/\1/p' file) )
string=( $(sed -rn '/target=/s/.*>(.*).torrent.*/\1/p' file) )
for i in $(seq 0 $((${#variable[@]}-1)))
do
  echo ${variable[$i]}
  echo ${string[$i]}
done
IFS=${OLD_IFS}

@jschiwal, I agree about limiting the number of sed commands to speed up the script. However I cannot think a method to assign results to separate variables, as requested by the OP. Unless we use a while read loop like this (without using shell arrays):

Code:

while read variable string
do
  echo $variable
  echo $string
done < <(sed -rn '/target=/s/.*href="([^"]+)".*>(.*).torrent.*/\1 \2/p' file)

ted_chou12 · 12-05-2011, 11:07 AM

Thanks, that was perfect!

jschiwal · 12-09-2011, 03:10 AM

My main point is that each sed command starts at the beginning of the file. You would make three reports instead of one, and if one tag is missing or a file is modified between sed commands, the lists could become misaligned. I agree that arrays are needed to hold all the values. The information extracted is incomplete because there is no meaningful field, or index (hash) associated with the lines.
You could gather statistic type info from it, but I think one multi field report would be more flexible than three lists.