Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi. I am trying to write a script that will pull down people's blog postings and turn them into mp3 files for listening on my ipod. I've been using wget to get the files, sed to remove all the formatting, text2wave to make them audio and lame to make them into mp3s for my ipod. I've hit a stumbling block where I want sed to go through an html file and delete everything except for the text between the
<div class="articleBody"> and the very next </div> tags. This is where the body of the article is. I don't want text2wave to read out the formatting. Any help would be great!
#!/bin/sh
# delete all content of $1 except for that between $begin and $end, inclusive
begin="<[[:space:]]*div class=\\\"articleBody\\\"[[:space:]]*>"
end="<[[:space:]]*\/div[[:space:]]*>"
sed -i -e "/$begin/,/$end/!d" -e "s/^.*\($begin\)/\1/" \
-e "s/\($end\).*$/\1/" -e "/$end/q" $1
Allowing for [[:space:]] characters in the tags might be overkill. Remove them if you want.
Edit: Uh oh. I think you didn't actually want the tags included. If so, just remove the "\1" from the substitute commands.
Last edited by blackhole54; 12-30-2006 at 03:13 AM.
OP, Is there only 1 set of "<div class="articleBody"> ... </div>" tags in each of your files?
blackhole54,
Nice script.
I don't think Allowing for [[:space:]] characters in the tags is overkill, I see that as anticipatory/defensive programming; especially since it makes a much cleaner file if the \1's are eliminated. BTW, cleaning up your pattern space w/ that initial '!d' is a trick I'll remember, thanks. It's also a pleasure to see a sed 1-liner as a solution, instead of 20 (or so) lines of Perl.
May I offer a couple of suggestions:
Use the "-r" option & avoid having to escape meta-characters.
Use ',' (or any other convenient character) to delimit regexes.
Use ';' instead of ' " -e " ' to link script pieces.
Capitalize variables in bash. (My stylistic preference, although I have a belief that it is the norm in many places.)
(Definitely a personal style preference.) Use the shortest possible variable names & whenever possible make similar ones the same length.
Code:
#!/bin/sh
# delete all content of $1 except for that between
# $B (begin) and $E (end), inclusive
B="<[[:space:]]*div class=\\\"articleBody\\\"[[:space:]]*>"
E="<[[:space:]]*\/div[[:space:]]*>"
sed -ir "/$B/,/$E/!d;s,^.*($B),\1,;s,($E).*$,\1," -e "/$E/q" $1
It's possible that putting the variable definitions between single quotes would eliminate some or all of the '\' characters, I don't have any appropriate files to test this on.
It's also a pleasure to see a sed 1-liner as a solution, instead of 20 (or so) lines of Perl.
Yah. I keep telling myself that one of these days I should learn pearl. My impression (perhaps wrong) was that this would be even easier to do in perl than sed.
Quote:
Use the "-r" option & avoid having to escape meta-characters.
Use ',' (or any other convenient character) to delimit regexes.
Yeah. I was testing this on an older version of sed (with simplistic, very artificle files) and wanted to be sure what I published worked.
Quote:
Use ';' instead of ' " -e " ' to link script pieces.
That's a new one to me -- thanks.
Quote:
(Definitely a personal style preference.) Use the shortest possible variable names & whenever possible make similar ones the same length.
This we have a strong disagreement about. I am a big fan of readibility. It was such a relief when I encountered languages that didn't have the 5 (or 6?) character limit of FORTRAN (and some assemblers). Of course, in a trivial script like this it doesn't matter. (And since scripts are their own executable so to speak, the considerations might be different in a script versus compiled program -- i.e. shortening variables to cut down the size of "exeutable".)
Quote:
It's possible that putting the variable definitions between single quotes would eliminate some or all of the '\' characters, I don't have any appropriate files to test this on.
Maybe. I came up with something that worked and didn't work on making it pretty. I showed the principles and figured the OP could massage it as he wanted. (Quoting in complex circumstances is something that still frequently gives me fits. But I think I am slowly getting better. )
Thanks for the kind words.
Last edited by blackhole54; 01-02-2007 at 07:08 PM.
If this "html file" is really an "xhtml file" (i.e., XML), then you could use XALAN or some other XSLT processor to extract and process what you need. Probably overkill for your application, but if you like learning new stuff...
Yah. I keep telling myself that one of these days I should learn pearl. My impression (perhaps wrong) was that this would be even easier to do in perl than sed.
My opinion: this one is much simpler in sed.
Quote:
Originally Posted by blackhole54
Yeah. I was testing this on an older version of sed (with simplistic, very artificle files) and wanted to be sure what I published worked.
Always a good idea . Although I can't lay my hands on my ancient copy of Sed & Awk, I'm pretty sure both "-r" & "," have worked for a long time. (BTW, "," works in Perl too.)
Quote:
Originally Posted by blackhole54
That's a new one to me -- thanks.
You're welcome.
Quote:
Originally Posted by blackhole54
This we have a strong disagreement about. I am a big fan of readibility. It was such a relief when I encountered languages that didn't have the 5 (or 6?) character limit of FORTRAN (and some assemblers). Of course, in a trivial script like this it doesn't matter. (And since scripts are their own executable so to speak, the considerations might be different in a script versus compiled program -- i.e. shortening variables to cut down the size of "exeutable".)
I don't think we disagree as much as you might think -- the key is the context: As you point out, this script is so short that 1-letter variables are fine. In a longer script or where there are more variables, I would make the names longer.
I think that if there were a formula for the minimum length of a given variable name, it would be strongly correlated to the number of times the variable is repeated & especially to the length of the gaps between repetitions. Frequently used or locally isolated variables can be short, infrequently used or widely dispersed variables need to be longer. How short & how long is a balancing act involving art & judgement.
For instance, many writers of firewall scripts feel compelled to assume that iptables may not be in the default PATH, so they explicitly define IPTABLES=/path/to/iptables. They then begin every line of their scripts $IPTABLES & almost every line $IPTABLES -A. IMO, this is a lot of unnecessary typing w/ its attendant opportunity for typos. Why not IPT=/path/to/iptables -A or at least IPT=/path/to/iptables? I suspect that by the 3rd to 10th line everybody will get the idea. In fact, I think this would increase the readability.
Anytime there is repetition, especially repetition of long sequences, particularly repetition of long sequences that are a single concept, there is a golden opportunity to use a variable. I have studied a number of iptables scripts, & its syntax is complicated enough that it cries out for this approach. The problem is that many of the lines (rules) get long & complicated and therefore hard to understand. The differences from line to line get obscured by the repeated material required by the syntax. If the writer of such a script were to make a set of gestalt-like variables for these repeated "concepts", the readers/maintainers of said firewall script would have a much easier time.
When I tackle my own firewall script using these principles, I'll be sure to post it here .
<rant /> -- 1st draft of something.
Quote:
Originally Posted by blackhole54
Maybe. I came up with something that worked and didn't work on making it pretty. I showed the principles and figured the OP could massage it as he wanted. (Quoting in complex circumstances is something that still frequently gives me fits. But I think I am slowly getting better. )
Hey, I didn't bother to test the quoting possibilities either . And complex quoting still gives me fits after 40 years -- started w/ SNOBOL II in college. . . .
I don't think we disagree as much as you might think -- the key is the context: As you point out, this script is so short that 1-letter variables are fine. In a longer script or where there are more variables, I would make the names longer.
It sounds like we have similar views to the considerations involved but probably frequently come to different conclusions about what is best in a particular situation. Then again I am probably not terribly consistent myself exactly how I'll tackle a given situation. And I am certainly "guilty" of the IPTABLES situation you described. Variables, macros, subroutines, etc., are certainly all useful in increasing readability and reducing errors.
Been nice chatting with somebody else who probably started out with punch cards. Maybe we should let the thread get back to the OP's concerns.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.