ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hello. In my linux class we're supposed to use sed to strip files from HTML tags with sed. Im kind of stuck and Ive tried to read up on regex but I'm getting overloaded in my brain.
We are supposed to delete everything from start of file until and including <body*> AND </body> and including to the end.
This is what I've come up with:
Code:
sed '1,/<body*/d ; /<\/body/,//d' index.html
This works as long as the <body> and </body> tag are on seperate lines, and as long as <body> is not on the very first line.
Can someone help me and point me in the right direction?
Help us to help you. Give us a sample input file. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
sed -e 's/\(<body[^\>]*.\)/\n\1\n/g ; s/\(<\/body>\)/\n\1\n/g ; 1,/<body*/d ; /<\/body/,//d' $1
I'm trying to catch the "<body" until the next ">" and make a new line before and after that pattern. Also the same with "</body>", new line before and after.
After that I delete everything before and including the line matching "<body". And then searching for "</body" and deleting that line and everything after.
# 1) sed to replace all line breaks with tilde (~).
# 2) sed to replace all "body" with backtick (`).
# 3) cut to keep text between first and second backtick.
# 4) cut to keep everything which follows the first >
# 5) sed to drop last two characters.
# 6) sed to replace all tildes with line breaks.
sed '{:q;N;s/\n/~/g;t q}' $InFile \
|sed -e 's/body/\`/g' \
|cut -d\` -f2 \
|cut -d\> -f2- \
|sed 's/.\{2\}$//' \
|sed 's/~/\n/g'
This works... but your task is to replace the two instances of cut with sed to accomplish the same thing.
gnused offers the '0' address, allowing you to match a range if the 2nd pattern appears on the first line. Also remember that you can use other delimiters if the default '/' character can appear in the expression.
Next, we have to consider this in at least two different steps. First we have to remove all lines that come before or after the ones with the body tags, and second we have to edit out the unwanted parts of the lines that do contain them. This is probably best done with multiple, nested expressions.
My attempt:
Code:
sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html
If you aren't using gnu sed, you'll probably have to include another expression to process the first line separately, if it should happen to contain the "<body>" tag.
Unfortunately, it prints the entire line the div regex resides on. So, I deleted everything from the </head tag until the <div tag.
Code:
s/^<\/head><[a-z].*;">//
I also replaced the entire bottom line with a single </div> tag. Then printed it.
Code:
$s/.*/<\/div>/g;p
So, this -> </div></body></html> becomes this -> </div>
I used the squiggly brackets after the to/from regex to allow me to edit the buffer instead of making another call to Sed.
Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.
Thanks, I was thinking of that solution. But there is one problem, the tags after <body*> might not always start with <div> and also the tag after </div> might not always be </body>.
Sorry sliddjur. I was focusing on the problem posted. Have you since figured your problem out or are you still in need of a solution that will work to fit the criteria you stated above in what I have quoted?
If so, how about David's solution?
Code:
sed '0,/<body/ { /<body/! d ; /<body/ s/.*<body[^>]*>// } ; \|</body|,$ { \|</body|! d ; \|</body| s|</body.*|| }' file.html
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.