sed

sliddjur · 11-01-2012, 11:16 AM

Hello. In my linux class we're supposed to use sed to strip files from HTML tags with sed. Im kind of stuck and Ive tried to read up on regex but I'm getting overloaded in my brain.

We are supposed to delete everything from start of file until and including <body*> AND </body> and including to the end.

This is what I've come up with:

Code:

sed '1,/<body*/d ; /<\/body/,//d' index.html

This works as long as the <body> and </body> tag are on seperate lines, and as long as <body> is not on the very first line.

Can someone help me and point me in the right direction?

danielbmartin · 11-01-2012, 01:24 PM

Help us to help you. Give us a sample input file. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

sliddjur · 11-01-2012, 06:58 PM

My command:

Code:

 sed -e 's/\(<body[^\>]*.\)/\n\1\n/g ; s/\(<\/body>\)/\n\1\n/g ; 1,/<body*/d ; /<\/body/,//d' $1

I'm trying to catch the "<body" until the next ">" and make a new line before and after that pattern. Also the same with "</body>", new line before and after.
After that I delete everything before and including the line matching "<body". And then searching for "</body" and deleting that line and everything after.

Sample file:

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

The output I'm looking for:

Code:

<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>

linosaurusroot · 11-02-2012, 04:47 AM

// matches no text which means every line but $ in a line context means last line.
sed '1,/<body*/d ; /<\/body/,$d' index.html

See also http://stackoverflow.com/questions/5...e-html-why-not and consider what would happen if your HTML was all one line.

danielbmartin · 11-02-2012, 10:30 AM

Here is a piece of code to get you started.

Code:

# 1) sed to replace all line breaks with tilde (~).
# 2) sed to replace all "body" with backtick (`).
# 3) cut to keep text between first and second backtick.
# 4) cut to keep everything which follows the first >
# 5) sed to drop last two characters.
# 6) sed to replace all tildes with line breaks.
sed '{:q;N;s/\n/~/g;t q}' $InFile  \
|sed -e 's/body/\`/g'              \
|cut -d\` -f2                      \
|cut -d\> -f2-                     \
|sed 's/.\{2\}$//'                 \
|sed 's/~/\n/g'

This works... but your task is to replace the two instances of cut with sed to accomplish the same thing.

Daniel B. Martin

amboxer21 · 11-02-2012, 09:23 PM

Here's my heavy solution ->

Code:

sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt

How it works is it prints everything from the div tag to in between.

Code:

/<div style/,/<\/div>/

Which results in

Code:

</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.

Code:

s/^<\/head><[a-z].*;">//

I also replaced the entire bottom line with a single </div> tag. Then printed it.

Code:

$s/.*/<\/div>/g;p

So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.

David the H. · 11-03-2012, 02:53 PM

gnu sed offers the '0' address, allowing you to match a range if the 2nd pattern appears on the first line. Also remember that you can use other delimiters if the default '/' character can appear in the expression.

Next, we have to consider this in at least two different steps. First we have to remove all lines that come before or after the ones with the body tags, and second we have to edit out the unwanted parts of the lines that do contain them. This is probably best done with multiple, nested expressions.

My attempt:

Code:

sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html

If you aren't using gnu sed, you'll probably have to include another expression to process the first line separately, if it should happen to contain the "<body>" tag.

Here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

I've found the sedfaq to be especially informative in difficult cases like this.

sliddjur · 11-06-2012, 03:17 PM

Quote:

Originally Posted by amboxer21

Here's my heavy solution ->

Code:

sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt

How it works is it prints everything from the div tag to in between.

Code:

/<div style/,/<\/div>/

Which results in

Code:

</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.

Code:

s/^<\/head><[a-z].*;">//

I also replaced the entire bottom line with a single </div> tag. Then printed it.

Code:

$s/.*/<\/div>/g;p

So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.

Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.

amboxer21 · 11-06-2012, 11:39 PM

Quote:

Originally Posted by sliddjur

Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.

Sorry sliddjur. I was focusing on the problem posted. Have you since figured your problem out or are you still in need of a solution that will work to fit the criteria you stated above in what I have quoted?

If so, how about David's solution?

Code:


sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html

or does the same problem still arise?

.

sliddjur · 11-07-2012, 09:45 AM

Yes I finally got this working, I will post the solution here later when I get home if someone is interested.

amboxer21 · 11-07-2012, 10:48 AM

Post what you came up with.

ctsgnb · 11-10-2012, 07:30 PM

A lazy one :

Code:

sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' yourfile

Using colon intead of the usual s/ / / syntax save some backslash to escape the / in the /div> handling.

Code:

# cat t1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>
# sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' t1
<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>
#