Download your favorite Linux distribution at LQ ISO.
Go Back > Forums > Linux Forums > Linux - Newbie
User Name
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!


  Search this Thread
Old 01-20-2017, 11:12 AM   #1
LQ Newbie
Registered: Jan 2017
Posts: 1

Rep: Reputation: Disabled
A bash script med sed to extract every thing efter <body> and before </body> tags from html or htm files

I am totaly stuck in writing this script file.
Script will recursively scan all html and htm pagaes from the location specified as argument. Secondly for each file removes everything before the <body> and everything after the </ body> including the <body> and </ body>.
Thirdly, result should be saved in another file, if original file for example called index.html, it becomes then index.html_nobody

i wrote this code but it is not giving desired result.

for file in $( ls $1 -r );
    if [ -d $file ];
        find -type f -name "*.html" -o -name "*.htm" -exec sed -e '1,/<body/ s/.*/ /' -e '/<\/body>/,$ s/.*/ /' "{}" > "{}_nobody" \;
        echo "Success!"
        exit 0

echo "Unvalid path, please try again."
exit 0
Is there any one who can help me out of this or give some valuable tips.
Old 01-20-2017, 11:35 AM   #2
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,006
Blog Entries: 3

Rep: Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633Reputation: 3633
Generally there are so many variations with HTML that you really need to use a proper HTML parser. perl has several. You could start with HTML::TreeBuilder, as one example, like this:


use warnings;
use strict;
use HTML::TreeBuilder;

my $file = shift || '/dev/stdin';

my $root = HTML::TreeBuilder->new_from_file( $file )
    or die( "Could not parse '$file' : $! \n");

my $zap = $root->look_down( _tag => q(body),, );

print $zap->as_HTML(undef, "  ");

exit ( 0 );
Old 01-20-2017, 11:57 AM   #3
LQ Veteran
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
bash sucks sometimes, this is one of them.</opinion>
Old 01-20-2017, 12:08 PM   #4
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,983

Rep: Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182Reputation: 3182
Is this a question for a course / homework? I do agree with Turbocapitalist that bash is not really the solution to something that can be particularly complicated like this, however, if it is for a
course / homework, then the instructor may have already devised simple web pages to allow the script to work.

If my assumption is correct, you have more pressing issues than how to solve this problem:

1. Your question asks for a location to be provided as an argument to the script. I interpret this to mean a directory, so your testing is around the wrong way, ie. you should be checking the user has passed
you a directory before continuing the script.

2. Previous question then leads into a decent error message to explain what the user has done wrong and perhaps leads to a 'usage' message to show what the correct process would be.

3. Do not use ls to provide data to a for loop as white space in file / directory names will cause issues

4. As your question is looking at recursion, the for loop is still the wrong construct for the same reason as above. You should look into a while loop.

5. If you are going to use find to perform the entire task, what is the point of using a loop at all??

6. At the moment, with fresh eyes and having only just written this code, your find / sed combination makes perfect sense ... how about in 6 months or a year?? Whilst 'cool' to be able to write a one-liner
to do everything, it becomes very hard to debug and extend or add to. If the instructor says now use the same script and perform other tasks on the data prior to creating the new files, the current
solution will become unwieldy very quickly.

7. Now this one is often met with, That's your opinion, I was often told that where possible, a script should have a single point of successful exit and only error points as other exit spots. Your current script may be given a directory / file that does not fit the purpose, but you give an error message and then exit the script successfully, yet the prior thing done was an error???

So I look forward to hearing back what type of problem this is Please take any and all comments above as you wish, you do not have to follow any of them, it is only advice
Old 01-21-2017, 04:11 AM   #5
LQ Addict
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051
perl has been mentioned, and python.
i would like to add xmllint to the pool of dedicated html parsing tools.
after understanding how xpath expressions are built, it's a handy command line tool.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Html tags inside PHP mail body message. linuxlover.chaitanya Programming 7 03-12-2010 12:50 AM
Grep data inside <body>*</body> only tpubcom Linux - Newbie 3 10-11-2009 06:19 PM
Hello every body, i'm using redhat linux I don't know about samba server any body seenas Linux - Newbie 2 07-04-2009 03:47 AM
sed command extract contents withing body tag of html Fond_of_Opensource Linux - Newbie 6 06-04-2007 07:55 AM
Extract body message from raw e-mail rigel_kent Programming 2 06-03-2006 06:07 AM > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:19 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration