LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-20-2017, 11:12 AM   #1
PetterS
LQ Newbie
 
Registered: Jan 2017
Posts: 1

Rep: Reputation: Disabled
A bash script med sed to extract every thing efter <body> and before </body> tags from html or htm files


I am totaly stuck in writing this script file.
Description:
Script will recursively scan all html and htm pagaes from the location specified as argument. Secondly for each file removes everything before the <body> and everything after the </ body> including the <body> and </ body>.
Thirdly, result should be saved in another file, if original file for example called index.html, it becomes then index.html_nobody

i wrote this code but it is not giving desired result.
Code:
#!/bin/bash

for file in $( ls $1 -r );
do
    if [ -d $file ];
    then
        find -type f -name "*.html" -o -name "*.htm" -exec sed -e '1,/<body/ s/.*/ /' -e '/<\/body>/,$ s/.*/ /' "{}" > "{}_nobody" \;
        
        echo "Success!"
        
        exit 0
    fi
done

echo "Unvalid path, please try again."
        
exit 0
Is there any one who can help me out of this or give some valuable tips.
 
Old 01-20-2017, 11:35 AM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Generally there are so many variations with HTML that you really need to use a proper HTML parser. perl has several. You could start with HTML::TreeBuilder, as one example, like this:

Code:
#!/usr/bin/perl                                                                 

use warnings;
use strict;
use HTML::TreeBuilder;

my $file = shift || '/dev/stdin';

my $root = HTML::TreeBuilder->new_from_file( $file )
    or die( "Could not parse '$file' : $! \n");

my $zap = $root->look_down( _tag => q(body),, );

print $zap->as_HTML(undef, "  ");

exit ( 0 );
 
Old 01-20-2017, 11:57 AM   #3
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
https://pypi.python.org/pypi/BeautifulSoup
bash sucks sometimes, this is one of them.</opinion>
 
Old 01-20-2017, 12:08 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Is this a question for a course / homework? I do agree with Turbocapitalist that bash is not really the solution to something that can be particularly complicated like this, however, if it is for a
course / homework, then the instructor may have already devised simple web pages to allow the script to work.

If my assumption is correct, you have more pressing issues than how to solve this problem:

1. Your question asks for a location to be provided as an argument to the script. I interpret this to mean a directory, so your testing is around the wrong way, ie. you should be checking the user has passed
you a directory before continuing the script.

2. Previous question then leads into a decent error message to explain what the user has done wrong and perhaps leads to a 'usage' message to show what the correct process would be.

3. Do not use ls to provide data to a for loop as white space in file / directory names will cause issues

4. As your question is looking at recursion, the for loop is still the wrong construct for the same reason as above. You should look into a while loop.

5. If you are going to use find to perform the entire task, what is the point of using a loop at all??

6. At the moment, with fresh eyes and having only just written this code, your find / sed combination makes perfect sense ... how about in 6 months or a year?? Whilst 'cool' to be able to write a one-liner
to do everything, it becomes very hard to debug and extend or add to. If the instructor says now use the same script and perform other tasks on the data prior to creating the new files, the current
solution will become unwieldy very quickly.

7. Now this one is often met with, That's your opinion, I was often told that where possible, a script should have a single point of successful exit and only error points as other exit spots. Your current script may be given a directory / file that does not fit the purpose, but you give an error message and then exit the script successfully, yet the prior thing done was an error???


So I look forward to hearing back what type of problem this is Please take any and all comments above as you wish, you do not have to follow any of them, it is only advice
 
Old 01-21-2017, 04:11 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
perl has been mentioned, and python.
i would like to add xmllint to the pool of dedicated html parsing tools.
after understanding how xpath expressions are built, it's a handy command line tool.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Html tags inside PHP mail body message. linuxlover.chaitanya Programming 7 03-12-2010 12:50 AM
Grep data inside <body>*</body> only tpubcom Linux - Newbie 3 10-11-2009 06:19 PM
Hello every body, i'm using redhat linux I don't know about samba server any body seenas Linux - Newbie 2 07-04-2009 03:47 AM
sed command extract contents withing body tag of html Fond_of_Opensource Linux - Newbie 6 06-04-2007 07:55 AM
Extract body message from raw e-mail rigel_kent Programming 2 06-03-2006 06:07 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 07:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration