LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-01-2012, 11:16 AM   #1
sliddjur
LQ Newbie
 
Registered: Nov 2012
Posts: 7

Rep: Reputation: Disabled
Question sed - delete everything before pattern?


Hello. In my linux class we're supposed to use sed to strip files from HTML tags with sed. Im kind of stuck and Ive tried to read up on regex but I'm getting overloaded in my brain.

We are supposed to delete everything from start of file until and including <body*> AND </body> and including to the end.

This is what I've come up with:
Code:
sed '1,/<body*/d ; /<\/body/,//d' index.html
This works as long as the <body> and </body> tag are on seperate lines, and as long as <body> is not on the very first line.

Can someone help me and point me in the right direction?
 
Old 11-01-2012, 01:24 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Help us to help you. Give us a sample input file. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin
 
Old 11-01-2012, 06:58 PM   #3
sliddjur
LQ Newbie
 
Registered: Nov 2012
Posts: 7

Original Poster
Rep: Reputation: Disabled
My command:
Code:
 sed -e 's/\(<body[^\>]*.\)/\n\1\n/g ; s/\(<\/body>\)/\n\1\n/g ; 1,/<body*/d ; /<\/body/,//d' $1
I'm trying to catch the "<body" until the next ">" and make a new line before and after that pattern. Also the same with "</body>", new line before and after.
After that I delete everything before and including the line matching "<body". And then searching for "</body" and deleting that line and everything after.

Sample file:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>

The output I'm looking for:
Code:
<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>
 
Old 11-02-2012, 04:47 AM   #4
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
// matches no text which means every line but $ in a line context means last line.
sed '1,/<body*/d ; /<\/body/,$d' index.html

See also http://stackoverflow.com/questions/5...e-html-why-not and consider what would happen if your HTML was all one line.
 
Old 11-02-2012, 10:30 AM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Here is a piece of code to get you started.
Code:
# 1) sed to replace all line breaks with tilde (~).
# 2) sed to replace all "body" with backtick (`).
# 3) cut to keep text between first and second backtick.
# 4) cut to keep everything which follows the first >
# 5) sed to drop last two characters.
# 6) sed to replace all tildes with line breaks.
sed '{:q;N;s/\n/~/g;t q}' $InFile  \
|sed -e 's/body/\`/g'              \
|cut -d\` -f2                      \
|cut -d\> -f2-                     \
|sed 's/.\{2\}$//'                 \
|sed 's/~/\n/g'
This works... but your task is to replace the two instances of cut with sed to accomplish the same thing.

Daniel B. Martin
 
Old 11-02-2012, 09:23 PM   #6
amboxer21
Member
 
Registered: Mar 2012
Location: New Jersey
Distribution: Gentoo
Posts: 291

Rep: Reputation: Disabled
Here's my heavy solution ->

Code:
sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt
How it works is it prints everything from the div tag to in between.
Code:
/<div style/,/<\/div>/
Which results in
Code:
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>
Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.
Code:
s/^<\/head><[a-z].*;">//
I also replaced the entire bottom line with a single </div> tag. Then printed it.
Code:
$s/.*/<\/div>/g;p
So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.

Last edited by amboxer21; 11-02-2012 at 11:55 PM.
 
1 members found this post helpful.
Old 11-03-2012, 02:53 PM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
gnu sed offers the '0' address, allowing you to match a range if the 2nd pattern appears on the first line. Also remember that you can use other delimiters if the default '/' character can appear in the expression.

Next, we have to consider this in at least two different steps. First we have to remove all lines that come before or after the ones with the body tags, and second we have to edit out the unwanted parts of the lines that do contain them. This is probably best done with multiple, nested expressions.

My attempt:
Code:
sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html
If you aren't using gnu sed, you'll probably have to include another expression to process the first line separately, if it should happen to contain the "<body>" tag.

Here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

I've found the sedfaq to be especially informative in difficult cases like this.
 
Old 11-06-2012, 03:17 PM   #8
sliddjur
LQ Newbie
 
Registered: Nov 2012
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by amboxer21 View Post
Here's my heavy solution ->

Code:
sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt
How it works is it prints everything from the div tag to in between.
Code:
/<div style/,/<\/div>/
Which results in
Code:
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="http://www.linuxquestions.org/questions/images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>
Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.
Code:
s/^<\/head><[a-z].*;">//
I also replaced the entire bottom line with a single </div> tag. Then printed it.
Code:
$s/.*/<\/div>/g;p
So, this -> </div></body></html> becomes this -> </div>

I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.
Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.
 
Old 11-06-2012, 11:39 PM   #9
amboxer21
Member
 
Registered: Mar 2012
Location: New Jersey
Distribution: Gentoo
Posts: 291

Rep: Reputation: Disabled
Quote:
Originally Posted by sliddjur View Post
Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.
Sorry sliddjur. I was focusing on the problem posted. Have you since figured your problem out or are you still in need of a solution that will work to fit the criteria you stated above in what I have quoted?

If so, how about David's solution?
Code:

sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html
or does the same problem still arise?

.

Last edited by amboxer21; 11-06-2012 at 11:53 PM.
 
Old 11-07-2012, 09:45 AM   #10
sliddjur
LQ Newbie
 
Registered: Nov 2012
Posts: 7

Original Poster
Rep: Reputation: Disabled
Yes I finally got this working, I will post the solution here later when I get home if someone is interested.
 
Old 11-07-2012, 10:48 AM   #11
amboxer21
Member
 
Registered: Mar 2012
Location: New Jersey
Distribution: Gentoo
Posts: 291

Rep: Reputation: Disabled
Post what you came up with.
 
Old 11-10-2012, 07:30 PM   #12
ctsgnb
LQ Newbie
 
Registered: Nov 2012
Posts: 3

Rep: Reputation: Disabled
A lazy one :

Code:
sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' yourfile
Using colon intead of the usual s/ / / syntax save some backslash to escape the / in the /div> handling.

Code:
# cat t1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">;
<html xmlns="http://www.w3.org/1999/xhtml">;
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div></body></html>
# sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' t1
<div style="width:100%; height:123px; margin-top:150px; text-align:center">
<p>Test</p>
<br />
<p>Hmm</p>
<img src="images/soon.png" width="428" height="123" alt="Check back soon" />
</div>
#

Last edited by ctsgnb; 11-10-2012 at 07:39 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep or sed to delete matching pattern rbalaa Linux - General 2 07-07-2011 03:28 PM
[SOLVED] sed: delete last line matching a pattern colucix Programming 3 03-27-2011 01:00 PM
How to use sed to delete all lines before the first match of a pattern? C_Blade Linux - Newbie 9 05-01-2010 04:18 AM
[SOLVED] sed: Find pattern and delete 5 lines after it supersoni3 Programming 4 03-24-2010 07:00 AM
sed: delete lines after last occurrence of a pattern in a file zugvogel Programming 4 11-17-2009 01:49 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration