LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-29-2006, 07:12 PM   #1
Benanzo
Member
 
Registered: Sep 2005
Location: Seattle
Distribution: Ubuntu, Debian flavors
Posts: 119

Rep: Reputation: 15
help with sed to remove all text except for some


Hi. I am trying to write a script that will pull down people's blog postings and turn them into mp3 files for listening on my ipod. I've been using wget to get the files, sed to remove all the formatting, text2wave to make them audio and lame to make them into mp3s for my ipod. I've hit a stumbling block where I want sed to go through an html file and delete everything except for the text between the

<div class="articleBody"> and the very next </div> tags. This is where the body of the article is. I don't want text2wave to read out the formatting. Any help would be great!

Thanks
 
Old 12-30-2006, 03:06 AM   #2
blackhole54
Senior Member
 
Registered: Mar 2006
Posts: 1,896

Rep: Reputation: 61
Try this.

Code:
#!/bin/sh

#  delete all content of $1 except for that between $begin and $end, inclusive

begin="<[[:space:]]*div class=\\\"articleBody\\\"[[:space:]]*>"
end="<[[:space:]]*\/div[[:space:]]*>"

sed -i -e "/$begin/,/$end/!d" -e "s/^.*\($begin\)/\1/" \
   -e "s/\($end\).*$/\1/" -e "/$end/q" $1
Allowing for [[:space:]] characters in the tags might be overkill. Remove them if you want.

Edit: Uh oh. I think you didn't actually want the tags included. If so, just remove the "\1" from the substitute commands.

Last edited by blackhole54; 12-30-2006 at 03:13 AM.
 
Old 01-02-2007, 12:31 PM   #3
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
OP, Is there only 1 set of "<div class="articleBody"> ... </div>" tags in each of your files?


blackhole54,

Nice script.

I don't think Allowing for [[:space:]] characters in the tags is overkill, I see that as anticipatory/defensive programming; especially since it makes a much cleaner file if the \1's are eliminated. BTW, cleaning up your pattern space w/ that initial '!d' is a trick I'll remember, thanks. It's also a pleasure to see a sed 1-liner as a solution, instead of 20 (or so) lines of Perl.

May I offer a couple of suggestions:
  • Use the "-r" option & avoid having to escape meta-characters.
  • Use ',' (or any other convenient character) to delimit regexes.
  • Use ';' instead of ' " -e " ' to link script pieces.
  • Capitalize variables in bash. (My stylistic preference, although I have a belief that it is the norm in many places.)
  • (Definitely a personal style preference.) Use the shortest possible variable names & whenever possible make similar ones the same length.

Code:
#!/bin/sh

#  delete all content of $1 except for that between 
#   $B (begin) and $E (end), inclusive

B="<[[:space:]]*div class=\\\"articleBody\\\"[[:space:]]*>"
E="<[[:space:]]*\/div[[:space:]]*>"

sed -ir "/$B/,/$E/!d;s,^.*($B),\1,;s,($E).*$,\1," -e "/$E/q" $1
It's possible that putting the variable definitions between single quotes would eliminate some or all of the '\' characters, I don't have any appropriate files to test this on.
 
Old 01-02-2007, 07:04 PM   #4
blackhole54
Senior Member
 
Registered: Mar 2006
Posts: 1,896

Rep: Reputation: 61
Quote:
Originally Posted by archtoad6
It's also a pleasure to see a sed 1-liner as a solution, instead of 20 (or so) lines of Perl.
Yah. I keep telling myself that one of these days I should learn pearl. My impression (perhaps wrong) was that this would be even easier to do in perl than sed.

Quote:
  • Use the "-r" option & avoid having to escape meta-characters.
  • Use ',' (or any other convenient character) to delimit regexes.
Yeah. I was testing this on an older version of sed (with simplistic, very artificle files) and wanted to be sure what I published worked.
Quote:
  • Use ';' instead of ' " -e " ' to link script pieces.
That's a new one to me -- thanks.



Quote:
  • (Definitely a personal style preference.) Use the shortest possible variable names & whenever possible make similar ones the same length.
This we have a strong disagreement about. I am a big fan of readibility. It was such a relief when I encountered languages that didn't have the 5 (or 6?) character limit of FORTRAN (and some assemblers). Of course, in a trivial script like this it doesn't matter. (And since scripts are their own executable so to speak, the considerations might be different in a script versus compiled program -- i.e. shortening variables to cut down the size of "exeutable".)

Quote:
It's possible that putting the variable definitions between single quotes would eliminate some or all of the '\' characters, I don't have any appropriate files to test this on.
Maybe. I came up with something that worked and didn't work on making it pretty. I showed the principles and figured the OP could massage it as he wanted. (Quoting in complex circumstances is something that still frequently gives me fits. But I think I am slowly getting better. )

Thanks for the kind words.

Last edited by blackhole54; 01-02-2007 at 07:08 PM.
 
Old 01-02-2007, 07:29 PM   #5
haertig
Senior Member
 
Registered: Nov 2004
Distribution: Debian, Ubuntu, LinuxMint, Slackware, SysrescueCD, Raspbian, Arch
Posts: 2,331

Rep: Reputation: 357Reputation: 357Reputation: 357Reputation: 357
If this "html file" is really an "xhtml file" (i.e., XML), then you could use XALAN or some other XSLT processor to extract and process what you need. Probably overkill for your application, but if you like learning new stuff...
 
Old 01-03-2007, 10:53 AM   #6
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by blackhole54
Yah. I keep telling myself that one of these days I should learn pearl. My impression (perhaps wrong) was that this would be even easier to do in perl than sed.
My opinion: this one is much simpler in sed.


Quote:
Originally Posted by blackhole54
Yeah. I was testing this on an older version of sed (with simplistic, very artificle files) and wanted to be sure what I published worked.
Always a good idea . Although I can't lay my hands on my ancient copy of Sed & Awk, I'm pretty sure both "-r" & "," have worked for a long time. (BTW, "," works in Perl too.)


Quote:
Originally Posted by blackhole54
That's a new one to me -- thanks.
You're welcome.


Quote:
Originally Posted by blackhole54
This we have a strong disagreement about. I am a big fan of readibility. It was such a relief when I encountered languages that didn't have the 5 (or 6?) character limit of FORTRAN (and some assemblers). Of course, in a trivial script like this it doesn't matter. (And since scripts are their own executable so to speak, the considerations might be different in a script versus compiled program -- i.e. shortening variables to cut down the size of "exeutable".)
I don't think we disagree as much as you might think -- the key is the context: As you point out, this script is so short that 1-letter variables are fine. In a longer script or where there are more variables, I would make the names longer.

I think that if there were a formula for the minimum length of a given variable name, it would be strongly correlated to the number of times the variable is repeated & especially to the length of the gaps between repetitions. Frequently used or locally isolated variables can be short, infrequently used or widely dispersed variables need to be longer. How short & how long is a balancing act involving art & judgement.

For instance, many writers of firewall scripts feel compelled to assume that iptables may not be in the default PATH, so they explicitly define IPTABLES=/path/to/iptables. They then begin every line of their scripts $IPTABLES & almost every line $IPTABLES -A. IMO, this is a lot of unnecessary typing w/ its attendant opportunity for typos. Why not IPT=/path/to/iptables -A or at least IPT=/path/to/iptables? I suspect that by the 3rd to 10th line everybody will get the idea. In fact, I think this would increase the readability.

Anytime there is repetition, especially repetition of long sequences, particularly repetition of long sequences that are a single concept, there is a golden opportunity to use a variable. I have studied a number of iptables scripts, & its syntax is complicated enough that it cries out for this approach. The problem is that many of the lines (rules) get long & complicated and therefore hard to understand. The differences from line to line get obscured by the repeated material required by the syntax. If the writer of such a script were to make a set of gestalt-like variables for these repeated "concepts", the readers/maintainers of said firewall script would have a much easier time.

When I tackle my own firewall script using these principles, I'll be sure to post it here .

<rant /> -- 1st draft of something.


Quote:
Originally Posted by blackhole54
Maybe. I came up with something that worked and didn't work on making it pretty. I showed the principles and figured the OP could massage it as he wanted. (Quoting in complex circumstances is something that still frequently gives me fits. But I think I am slowly getting better. )
Hey, I didn't bother to test the quoting possibilities either . And complex quoting still gives me fits after 40 years -- started w/ SNOBOL II in college. . . .

Quote:
Originally Posted by blackhole54
Thanks for the kind words.
Again, you're welcome.
 
Old 01-03-2007, 08:23 PM   #7
blackhole54
Senior Member
 
Registered: Mar 2006
Posts: 1,896

Rep: Reputation: 61
Quote:
Originally Posted by archtoad6
I don't think we disagree as much as you might think -- the key is the context: As you point out, this script is so short that 1-letter variables are fine. In a longer script or where there are more variables, I would make the names longer.
It sounds like we have similar views to the considerations involved but probably frequently come to different conclusions about what is best in a particular situation. Then again I am probably not terribly consistent myself exactly how I'll tackle a given situation. And I am certainly "guilty" of the IPTABLES situation you described. Variables, macros, subroutines, etc., are certainly all useful in increasing readability and reducing errors.

Been nice chatting with somebody else who probably started out with punch cards. Maybe we should let the thread get back to the OP's concerns.
 
Old 01-04-2007, 06:21 AM   #8
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
Yes, yes, yes, & yes; especially the punch cards at 3am. I do think we answered OP's Q/concerns, in case we haven't, let me quote myself:

Quote:
Originally Posted by archtoad6
OP, Is there only 1 set of "<div class="articleBody"> ... </div>" tags in each of your files?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
SED - remove last four characters from string 3saul Linux - Software 12 01-16-2023 10:21 AM
how to remove path using sed dtcs Programming 2 12-25-2006 04:29 PM
Remove string in sed twantrd Programming 7 09-13-2006 02:28 PM
SED - display text on specific line of text file 3saul Linux - Software 3 12-29-2005 04:32 PM
Using sed in bash to remove whitespace jimieee Programming 3 01-28-2004 10:33 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:10 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration