sed: one or more occurrences of a pattern
Hi all,
I'm getting the hang of sed but have encountered a problem that I don't know how to tackle. I'm using sed (in a BASH script) to recuperate information from a webpage. Sometimes this information doesn't exist, sometimes it does, but only once, sometimes it exists several times. I'll explain with an example... Let's say the webpage contains a line that always begins with <div id="results" ... only once per page maximum... (0..1). After using wget to recuperate the page and storing it in a file called temp.html, I recuperate that line using: Code:
cat temp.html | grep '<div id="results"' | head -n 1 I then want to use sed (or another program) to recuperate the substring or substrings in that line which always begin with <div class="entry" and always end with </div><div class="content">, however the recuperated line can contain from 0 to many (0..*) occurrences of this. I want to recuperate all occurrences, if any. And that's what I don't know how to do. I don't even know if using sed is the way to go. Any takers? Thanks in advance, rm |
Quote:
With regards to your issue, sed can certainly do that - when you say Quote:
Cna you perhaps post a small sample of the HTML page? |
Code:
<html> |
Try this ...
Code:
# 1) Prefix all "<div" with line breaks. |
Many thanks danielbmartin, I think we're on the right track!
What you gave me gives the following results on the actual .html source files I'm using (this is perhaps my fault, I gave you a simplified version): Code:
elephant </div Code:
sed 's/<div/\n<div/g' temp.html \ Code:
elephant No need to bother with cat then... and cut is cool. How would I alter the above if I wanted to redirect wget to stdout instead of creating and removing a temp.html file? Would that be faster? Any further suggestions? Many thanks, rm |
Quote:
Code:
wget ((something or other)) \ |
To be honest, line and regex-based tools like sed are not well-suited for use on html/xml, due to their flexible, nested, tag-based nature. While you can use them if the formatting is regular and well-structured, there's always a chance that they will fail.
In the long run it's better to use a tool with a dedicated parser for the syntax. I've been playing with xmlstarlet recently, and you can use it fairly easily to extract data from xhtml-formatted files. First, let's use a file that's actually formatted in proper html: Code:
<html> Code:
tidy -n -asxml file.html 2>/dev/null >file.xhtml Code:
xmlstarlet sel --html -T -t -v '//*[@id="results"]/*[@class="entry"]' -n file.xhtml The output, natch... Code:
elephant... To match a specific kind of html tag, for instance, you apparently need to use the name function: Code:
-v '//*[name()="div"][@id="results"]/*[@class="entry"]' Both tidy and xmlstarlet can read from stdin, BTW, so you could also pipe the commands together instead of using external files. |
To use with wget just do :
Code:
wget -O - | sed ... |
Quote:
Code:
xmlstarlet fo --quiet --html file.html >file.xhtml Quote:
|
Thank you very much! That's great to know. I had wondered whether xmlstarlet could convert or work on html directly, but I couldn't locate how to do it in the documentation.
And yeah, I tried several things like '//div[@id="results"]' but couldn't get it to work, and the only example the documentation has for html uses the name function (actually local-name, but I deduced that you could replace it). Using the fo command to convert the file appears to solve the namespace issue, so we should now be able to pipe it all together like this: Code:
wget -O- source.com | xmlstarlet fo --quiet --html | xmlstarlet sel --html -T -t -v '//div[@id="results"]/div[@class="entry"]' -n Edit: Hmm, I just ran a test, and at least with my example html file above, it looks like it works even without the formatting step. I know I've tried it before on other sources though without success. Perhaps it's rather finicky about the formatting? |
Quote:
Quote:
Quote:
|
Hi all,
Wow, lots of response. Cool tools! I can't seem to get it working on my end. Could someone give me a working example that has been tested? A simple html document and the command? (xmlstarlet) Another problem I've been having... With sed, I am able to insert new lines (\r or \n... or \\r \\n depending on whether an escape character is needed), but I can't seem to recognize a change of line. Several problems actually:
Thx, rm |
I've already shown you a very simple example of how it works. For more a more detailed version it would probably be better if you gave us an example or two examples of html files you would be likely to work with and what you'd want extracted from it.
ntubski obviously knows much more about it than me, but I will try to cover a few things I've learned so far about xmlstarlet and xpath. To extract data use the sel subcommand. --html/-H is needed for xhtml input, obviously, -T means output as plain text, and -t indicates the beginning of the "template" options. The -m template option can be used to match entries (acting as a "foreach" expression), and -v is used to print values on a sucessful match. But as seen we can often just use -v alone for simple global matching+printing. Both single and double quote-marks can be used in expressions for nested string grouping. The expressions can be quite finicky about proper quoting. As for the xpath expressions, Here are few of the basics as I understand them. / at the front of a path entry matches tags at only that single, specific level. // in front of an entry makes it recursively match all sub-levels from that point on. @ references tag attribute names. [] brackets are used to limit a match to certain criteria. There are a mass of functions available for doing things like printing substrings or mathematic expressions. See the reference link I gave earlier. . can be used to reference a previously matched value. Code:
xmlstarlet sel -T -t -m '//div[@id="text"]/p[not(@class="ignore")]' -v '.' -n At this point though I usually still have to just keep trying various combinations until I get what I want. It's all rather complex and there's a lot to learn, but it does get easier with experience. |
Okay, I guess my problem is that the html / xhtml files are not properly formatted. They contain some scripts and comments and it doesn't work.
The idea is to recuperate data from a large number of pages, so if each page has to be cleaned and reformatted, xmlstarlet may not be the way to go. Good tool though. I have done some XPath and XQuery, but it's been a while. |
xmlstarlet is cool and so is the php5-tidy program. I've opted nevertheless for more standard bash tools because the html pages in question are not always well-formed and I'm downloading info from a large number of pages.
I've got some specific questions regarding sed. Here they are: Here's my sed statement/command: Code:
sed 's#^<div class="entry">\(.*\)</div>.*[ ]*.*#\1#g' Code:
sed 's/<div/\n<div/g' temp.html \ When the html code is: Code:
...<div class="entry">Hello there!</div></a></li></ul></div> Code:
Hello there!</div></a></li></ul> Code:
<div class="entry">He <B>is</B> here </div> Code:
He <B>is</B> here Code:
He is here
rm |
All times are GMT -5. The time now is 11:05 PM. |