LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   sed - delete everything before pattern? (https://www.linuxquestions.org/questions/programming-9/sed-delete-everything-before-pattern-4175435094/)

sliddjur 11-01-2012 11:16 AM

sed - delete everything before pattern?
 
Hello. In my linux class we're supposed to use sed to strip files from HTML tags with sed. Im kind of stuck and Ive tried to read up on regex but I'm getting overloaded in my brain.

We are supposed to delete everything from start of file until and including <body*> AND </body> and including to the end.

This is what I've come up with:
Code:

sed '1,/<body*/d ; /<\/body/,//d' index.html
This works as long as the <body> and </body> tag are on seperate lines, and as long as <body> is not on the very first line.

Can someone help me and point me in the right direction?

danielbmartin 11-01-2012 01:24 PM

Help us to help you. Give us a sample input file. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

sliddjur 11-01-2012 06:58 PM

My command:
Code:

sed -e 's/\(<body[^\>]*.\)/\n\1\n/g ; s/\(<\/body>\)/\n\1\n/g ; 1,/<body*/d ; /<\/body/,//d' $1
I'm trying to catch the "<body" until the next ">" and make a new line before and after that pattern. Also the same with "</body>", new line before and after.
After that I delete everything before and including the line matching "<body". And then searching for "</body" and deleting that line and everything after.

Sample file:
Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>


The output I'm looking for:
Code:

<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>


linosaurusroot 11-02-2012 04:47 AM

// matches no text which means every line but $ in a line context means last line.
sed '1,/<body*/d ; /<\/body/,$d' index.html

See also http://stackoverflow.com/questions/5...e-html-why-not and consider what would happen if your HTML was all one line.

danielbmartin 11-02-2012 10:30 AM

Here is a piece of code to get you started.
Code:

# 1) sed to replace all line breaks with tilde (~).
# 2) sed to replace all "body" with backtick (`).
# 3) cut to keep text between first and second backtick.
# 4) cut to keep everything which follows the first >
# 5) sed to drop last two characters.
# 6) sed to replace all tildes with line breaks.
sed '{:q;N;s/\n/~/g;t q}' $InFile  \
|sed -e 's/body/\`/g'              \
|cut -d\` -f2                      \
|cut -d\> -f2-                    \
|sed 's/.\{2\}$//'                \
|sed 's/~/\n/g'

This works... but your task is to replace the two instances of cut with sed to accomplish the same thing.

Daniel B. Martin

amboxer21 11-02-2012 09:23 PM

Here's my heavy solution ->

Code:

sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt
How it works is it prints everything from the div tag to in between.
Code:

/<div style/,/<\/div>/
Which results in
Code:

</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.
Code:

s/^<\/head><[a-z].*;">//
I also replaced the entire bottom line with a single </div> tag. Then printed it.
Code:

$s/.*/<\/div>/g;p
So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.

David the H. 11-03-2012 02:53 PM

gnu sed offers the '0' address, allowing you to match a range if the 2nd pattern appears on the first line. Also remember that you can use other delimiters if the default '/' character can appear in the expression.

Next, we have to consider this in at least two different steps. First we have to remove all lines that come before or after the ones with the body tags, and second we have to edit out the unwanted parts of the lines that do contain them. This is probably best done with multiple, nested expressions.

My attempt:
Code:

sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html
If you aren't using gnu sed, you'll probably have to include another expression to process the first line separately, if it should happen to contain the "<body>" tag.

Here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

I've found the sedfaq to be especially informative in difficult cases like this.

sliddjur 11-06-2012 03:17 PM

Quote:

Originally Posted by amboxer21 (Post 4821158)
Here's my heavy solution ->

Code:

sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt
How it works is it prints everything from the div tag to in between.
Code:

/<div style/,/<\/div>/
Which results in
Code:

</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.
Code:

s/^<\/head><[a-z].*;">//
I also replaced the entire bottom line with a single </div> tag. Then printed it.
Code:

$s/.*/<\/div>/g;p
So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.

Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.

amboxer21 11-06-2012 11:39 PM

Quote:

Originally Posted by sliddjur (Post 4823659)
Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.

Sorry sliddjur. I was focusing on the problem posted. Have you since figured your problem out or are you still in need of a solution that will work to fit the criteria you stated above in what I have quoted?

If so, how about David's solution?
Code:


sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html

or does the same problem still arise?

.

sliddjur 11-07-2012 09:45 AM

Yes I finally got this working, I will post the solution here later when I get home if someone is interested.

amboxer21 11-07-2012 10:48 AM

Post what you came up with.

ctsgnb 11-10-2012 07:30 PM

A lazy one :

Code:

sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' yourfile
Using colon intead of the usual s/ / / syntax save some backslash to escape the / in the /div> handling.

Code:

# cat t1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>
# sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' t1
<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>
#



All times are GMT -5. The time now is 07:23 PM.