A bash script med sed to extract every thing efter <body> and before </body> tags from html or htm files

PetterS · 01-20-2017, 11:12 AM

I am totaly stuck in writing this script file.
Description:
Script will recursively scan all html and htm pagaes from the location specified as argument. Secondly for each file removes everything before the <body> and everything after the </ body> including the <body> and </ body>.
Thirdly, result should be saved in another file, if original file for example called index.html, it becomes then index.html_nobody

i wrote this code but it is not giving desired result.

Code:

#!/bin/bash

for file in $( ls $1 -r );
do
    if [ -d $file ];
    then
        find -type f -name "*.html" -o -name "*.htm" -exec sed -e '1,/<body/ s/.*/ /' -e '/<\/body>/,$ s/.*/ /' "{}" > "{}_nobody" \;
        
        echo "Success!"
        
        exit 0
    fi
done

echo "Unvalid path, please try again."
        
exit 0

Is there any one who can help me out of this or give some valuable tips.

Turbocapitalist · 01-20-2017, 11:35 AM

Generally there are so many variations with HTML that you really need to use a proper HTML parser. perl has several. You could start with HTML::TreeBuilder, as one example, like this:

Code:

#!/usr/bin/perl                                                                 

use warnings;
use strict;
use HTML::TreeBuilder;

my $file = shift || '/dev/stdin';

my $root = HTML::TreeBuilder->new_from_file( $file )
    or die( "Could not parse '$file' : $! \n");

my $zap = $root->look_down( _tag => q(body),, );

print $zap->as_HTML(undef, "  ");

exit ( 0 );

Habitual · 01-20-2017, 11:57 AM

https://pypi.python.org/pypi/BeautifulSoup
bash sucks sometimes, this is one of them.</opinion>

grail · 01-20-2017, 12:08 PM

Is this a question for a course / homework? I do agree with Turbocapitalist that bash is not really the solution to something that can be particularly complicated like this, however, if it is for a
course / homework, then the instructor may have already devised simple web pages to allow the script to work.

If my assumption is correct, you have more pressing issues than how to solve this problem:

1. Your question asks for a location to be provided as an argument to the script. I interpret this to mean a directory, so your testing is around the wrong way, ie. you should be checking the user has passed
you a directory before continuing the script.

2. Previous question then leads into a decent error message to explain what the user has done wrong and perhaps leads to a 'usage' message to show what the correct process would be.

3. Do not use ls to provide data to a for loop as white space in file / directory names will cause issues

4. As your question is looking at recursion, the for loop is still the wrong construct for the same reason as above. You should look into a while loop.

5. If you are going to use find to perform the entire task, what is the point of using a loop at all??

6. At the moment, with fresh eyes and having only just written this code, your find / sed combination makes perfect sense ... how about in 6 months or a year?? Whilst 'cool' to be able to write a one-liner
to do everything, it becomes very hard to debug and extend or add to. If the instructor says now use the same script and perform other tasks on the data prior to creating the new files, the current
solution will become unwieldy very quickly.

7. Now this one is often met with, That's your opinion, I was often told that where possible, a script should have a single point of successful exit and only error points as other exit spots. Your current script may be given a directory / file that does not fit the purpose, but you give an error message and then exit the script successfully, yet the prior thing done was an error???

So I look forward to hearing back what type of problem this is

Please take any and all comments above as you wish, you do not have to follow any of them, it is only advice

ondoho · 01-21-2017, 04:11 AM

perl has been mentioned, and python.
i would like to add xmllint to the pool of dedicated html parsing tools.
after understanding how xpath expressions are built, it's a handy command line tool.