LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-16-2012, 07:18 AM   #1
olleolle
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Rep: Reputation: Disabled
Strip html-document


Hi,

I have an html-document which I want to edit with a script. I want to remove all tags except the tags within the body-tag.

I have an document like this:

Text
<body>
<div>
<img src="hi.jpeg" alt="">
</div>
</body>
Text text

I want the document like this:

<div>
<img src="hi.jpeg" alt="">
</div>

So far I've done this. I know this only would remove the text before the body-tag, but I'm stuck now...

sed 's/[^<body>]*//'
 
Old 01-16-2012, 08:00 AM   #2
angel115
Member
 
Registered: Jul 2005
Location: France / Ireland
Distribution: Debian mainly, and Ubuntu
Posts: 535

Rep: Reputation: 79
Hi Olleolle,

I could achieve what you want to do with this:
Code:
awk '/^<body/, /<\/body>$/' index.html |sed -e "s/<body.*>//g" | sed  -e "s/<\/body.*>//g"
PS: Although it works, I guess you will have to clean up the sed code that I've put at the end, but I don't have the time right to spend more time on this post.

Best regards,
Angel.
 
Old 01-16-2012, 02:02 PM   #3
devUnix
Member
 
Registered: Oct 2010
Location: Bengaluru, India
Distribution: RHEL 5.1 on My PC, & SunOS / Sun Solaris, RHEL, SuSe, Debian, FreeBSD and other Linux flavors @ Work
Posts: 584

Rep: Reputation: 59
Sample HTML File:


cat file.html
Code:
<html>
<head>
<title></title>
<body>
<form>
<input type=text />
<input type=submit value="Submit" />
</form>
</body>
</html>
Parsing the HTML File:

Code:
 cat file.html | awk 'BEGIN{startBody=0;endBody=0;}{if(/^<body/){startBody=1;}if(/<\/body>/){endBody=1;}if(startBody==1){print $0}if(endBody==1){exit;}}'
<body>
<form>
<input type=text />
<input type=submit value="Submit" />
</form>
</body>
Enjoy!

Last edited by devUnix; 01-16-2012 at 02:03 PM.
 
Old 01-16-2012, 02:21 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian + kde 4 / 5
Posts: 6,834

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Please use [code][/code] tags around your code and data, to preserve formatting and to improve readability, Please do not use quote tags, colors, or other fancy formatting.

What you're really saying is that you want to remove everything from the start of the file to the <body> tag, and also everything from the </body> tag to the end of the file, correct?

In that case, it's time for you learn about sed's addressing ability, and commands other than "s" (substitution).

Code:
sed -e '1,\|<body>| d' -e '\|</body>|,$ d'
sed's full expression syntax is '<address1>,<address2> <commands>' Each address is either a line number or a regular expression that matches something on a line. Anything between and including the matched lines has the commands following it applied. Commands are single letters, sometimes followed by other parameters; e.g. "s" is the substitution command, followed by the actual substitution pattern inside "///".

In this case I've given sed two separate expressions (using -e). The first expression is from line 1 to "<body>", with the command being "d" for delete. The second expression is similarly from </body> to the last line, "$".

Regexes matches are bracketed by /../ by default, but since those characters exist in the input text, you can use a different character as long as you prefix it with a backslash first. In this case I decided to use \|..| instead.

Here are a few useful sed references.
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt


BTW, the command you posted would not work, for two reasons. First, because by default sed operates only on a per-line basis, and without addresses it attempts to apply the same pattern to every line. Second, that regular expression does not do what you think it does. The '[..]' brackets in regex specify lists of individual characters to match, not strings, and when the first character inside them is "^", it reverses the match. So what that command really does is remove the first contiguous string that contains anything other than the characters b,d,o,y,<,> from every line.

Here are a couple of regular expressions tutorials:
http://mywiki.wooledge.org/RegularExpression
http://www.grymoire.com/Unix/Regular.html

Edit: This all assumes that you have a cleanly-formatted html file with the body tags on their own separate lines. If the tags run together, you'd have to craft a more careful sed command, that eliminates only the part of the line you don't want.

Code:
sed -e '1,\|<body>| {s|.*<body>||p; d}' -e '\|</body>|,$ {s|</body>.*||p;d}'
I'll leave it to you to figure out how it works.

Last edited by David the H.; 01-16-2012 at 02:49 PM. Reason: as stated
 
1 members found this post helpful.
Old 01-17-2012, 05:26 AM   #5
olleolle
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi!

Thank you so much for your quick and well explained answers.
It's nice having you here for us Linux beginners!
Finally, thanks again and especially to you David the H!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed with html document jpgauvin Linux - Newbie 4 12-22-2008 07:54 PM
html; character encoding per document tag, not whole document TheLinuxDuck Programming 0 08-14-2008 12:12 PM
Strip Mime & HTML from MBOX files Andrew_OC Linux - Server 8 03-28-2007 04:18 AM
strip html tags rblampain Programming 6 08-07-2005 07:22 AM
Fooling a HTML document? eantoranz Programming 11 11-11-2004 08:42 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:11 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration