sed - delete everything before pattern?
Hello. In my linux class we're supposed to use sed to strip files from HTML tags with sed. Im kind of stuck and Ive tried to read up on regex but I'm getting overloaded in my brain.
We are supposed to delete everything from start of file until and including <body*> AND </body> and including to the end. This is what I've come up with: Code:
sed '1,/<body*/d ; /<\/body/,//d' index.html Can someone help me and point me in the right direction? |
Help us to help you. Give us a sample input file. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
Daniel B. Martin |
My command:
Code:
sed -e 's/\(<body[^\>]*.\)/\n\1\n/g ; s/\(<\/body>\)/\n\1\n/g ; 1,/<body*/d ; /<\/body/,//d' $1 After that I delete everything before and including the line matching "<body". And then searching for "</body" and deleting that line and everything after. Sample file: Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">; The output I'm looking for: Code:
<div style="width:100%; height:123px; margin-top:150px; text-align:center"> |
// matches no text which means every line but $ in a line context means last line.
sed '1,/<body*/d ; /<\/body/,$d' index.html See also http://stackoverflow.com/questions/5...e-html-why-not and consider what would happen if your HTML was all one line. |
Here is a piece of code to get you started.
Code:
# 1) sed to replace all line breaks with tilde (~). Daniel B. Martin |
Here's my heavy solution ->
Code:
sed -n '/<div style/,/<\/div>/{s/^<\/head><[a-z].*;">//;$s/.*/<\/div>/g;p}' filename.txt Code:
/<div style/,/<\/div>/ Code:
</head><body style="background-color:black;"><div style="width:100%; height:123px; margin-top:150px; text-align:center"> Code:
s/^<\/head><[a-z].*;">// Code:
$s/.*/<\/div>/g;p I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed. |
gnu sed offers the '0' address, allowing you to match a range if the 2nd pattern appears on the first line. Also remember that you can use other delimiters if the default '/' character can appear in the expression.
Next, we have to consider this in at least two different steps. First we have to remove all lines that come before or after the ones with the body tags, and second we have to edit out the unwanted parts of the lines that do contain them. This is probably best done with multiple, nested expressions. My attempt: Code:
sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html Here are a few useful sed references: http://www.grymoire.com/Unix/Sed.html http://sed.sourceforge.net/grabbag/ http://sed.sourceforge.net/sedfaq.html http://sed.sourceforge.net/sed1line.txt I've found the sedfaq to be especially informative in difficult cases like this. |
Quote:
|
Quote:
If so, how about David's solution? Code:
. |
Yes I finally got this working, I will post the solution here later when I get home if someone is interested.
|
Post what you came up with.
|
A lazy one :
Code:
sed '/<div/,/<\/div/!d;s:.*<div:<div:;s:/div>.*:/div>:' yourfile Code:
# cat t1 |
All times are GMT -5. The time now is 07:23 PM. |