LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices



Reply
 
Search this Thread
Old 03-30-2013, 06:39 AM   #1
olleolle
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Rep: Reputation: Disabled
Format HTML-documents with SED


Hi!

I need help with the finishing touch on my script. I want to strip an .HTML/.HTM document and remove all the HTML-tags before the BODY-tag starts and everything after the BODY-tag ends and save that to a new file, example.html_nobody

Code:
<html><body><div><p>Hello World</p></div></body></html>
Should result in:

Code:
<div><p>Hello World</p></div>
__________________________________

This is what I have right nog. The problem is that my script can't handle multiple tags on a single line. It only works when all tags are separated by line break. Any idea? Thanks!

Code:
#!/bin/bash

sok=$(find / -type f \( -name '*.html' -o -name '*.htm' \))

for f in $sok
do
echo `cat $f$1 | sed -e '1,\|<body| d' -e '\|</body>|,$ d'` > $f"_nobody" 
done
 
Old 03-30-2013, 06:56 AM   #2
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,797
Blog Entries: 4

Rep: Reputation: 285Reputation: 285Reputation: 285
Just try:
Code:
sed -e 's/<html><body>//;s/<\/body><\/html>//'
Script will be then look like:
Code:
#!/bin/bash

sok=$(find / -type f \( -name '*.html' -o -name '*.htm' \))

for f in $sok
do
echo $(cat $f$1 | sed -e 's/<html><body>//;s/<\/body><\/html>//') > $f"_nobody" 
done
OR, Better use, while+read insead of for, as:
Code:
#!/bin/bash
sok=$(find / -type f \( -name '*.html' -o -name '*.htm' \))
while read -r f
do
echo $(cat $f$1 | sed -e 's/<html><body>//;s/<\/body><\/html>//') > $f"_nobody" 
done < <(find / -type f \( -name '*.html' -o -name '*.htm' \))
 
1 members found this post helpful.
Old 03-30-2013, 07:05 AM   #3
olleolle
LQ Newbie
 
Registered: Jan 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thank you shivaa for the quick reply!

Sorry for not beeing clear enough. The script should be able to handle to delete everything before the BODY-tag.
For example:

With other words. The script must handle every single every possible syntax before the BODY-tag.

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<body style="background-color:red;"><div><p>"Hello world!</p><div><body>
Should result in:

Code:
<div><p>"Hello world!</p><div>
 
Old 03-30-2013, 07:30 AM   #4
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,797
Blog Entries: 4

Rep: Reputation: 285Reputation: 285Reputation: 285
Can you try awk:
Code:
~$ echo "<html><body><div><p>Hello World</p></div></body></html>" | gawk 'BEGIN{FS="<div>";OFS=""}; {print FS,$2,FS}'| gawk 'BEGIN{FS="</div>";OFS=""}; {print $1,FS}'
Script will look like:
Code:
#!/bin/bash
sok=$(find / -type f \( -name '*.html' -o -name '*.htm' \))
for f in $sok
do
echo $(cat $f$1 | gawk 'BEGIN{FS="<div>";OFS=""}; {print FS,$2,FS}'| gawk 'BEGIN{FS="</div>";OFS=""}; {print $1,FS}') > $f"_nobody" 
done

Last edited by shivaa; 03-30-2013 at 07:39 AM. Reason: Modification in cmd
 
Old 03-30-2013, 07:33 AM   #5
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
You might want to first process the file so that <body> starts on a new line. Imagine an HTML page where all the linefeeds are removed.

You can use a range to just process the lines between the <body> & </body> tags.

sed -n '/<body.*>/,/<\/body>/{
...
}'

The curly bracket form a block which can contain sub-blocks.

/<body[ >/{s/.*<body[^>]*>//}

Covering "every possible syntax" may entail testing for different patterns. Be careful for false positives. For example consider a tag <bodycolor ... So include the > or whitespace after "body" in your match.

Using an xslt based tool is usually a better way to handle XML based files.
 
  


Reply

Tags
bash script $@, sed bash, sed regex


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I embed PDF documents in an HTML file baronobeefdip Programming 3 03-30-2013 01:47 PM
[SOLVED] Can't print documents in landscape format kikinovak Slackware 5 04-24-2012 05:27 AM
SVN : - some mails are coming in plain format (html coding) & some in html format deepakdeore2004 Linux - General 0 05-06-2010 02:54 AM
Viewing HTML Documents In KWord With Original Formatting Mark7 Linux - Software 1 06-18-2007 03:46 AM
convert html file to latex format using sed? BigHeadDog Linux - Newbie 2 12-02-2003 12:30 AM


All times are GMT -5. The time now is 10:00 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration