LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   sed: one or more occurrences of a pattern (https://www.linuxquestions.org/questions/programming-9/sed-one-or-more-occurrences-of-a-pattern-4175434709/)

rm_-rf_windows 10-30-2012 02:09 AM

sed: one or more occurrences of a pattern
 
Hi all,

I'm getting the hang of sed but have encountered a problem that I don't know how to tackle.

I'm using sed (in a BASH script) to recuperate information from a webpage. Sometimes this information doesn't exist, sometimes it does, but only once, sometimes it exists several times. I'll explain with an example...

Let's say the webpage contains a line that always begins with <div id="results" ... only once per page maximum... (0..1). After using wget to recuperate the page and storing it in a file called temp.html, I recuperate that line using:

Code:

cat temp.html | grep '<div id="results"' | head -n 1
That I've succeeded in doing no problem.

I then want to use sed (or another program) to recuperate the substring or substrings in that line which always begin with <div class="entry" and always end with </div><div class="content">, however the recuperated line can contain from 0 to many (0..*) occurrences of this. I want to recuperate all occurrences, if any. And that's what I don't know how to do. I don't even know if using sed is the way to go.

Any takers?

Thanks in advance,

rm

devnull10 10-30-2012 03:29 AM

Quote:

After using wget to recuperate the page and storing it in a file called temp.html
Yo ucan just have wget dump to stdout and pipe straight into the next step... might make things a little easier and possibly removes the need for the temp file.

With regards to your issue, sed can certainly do that - when you say

Quote:

to recuperate the substring or substrings in that line
are you saying that everything will always be on the same line? Or that the data will always be on one line?

Cna you perhaps post a small sample of the HTML page?

rm_-rf_windows 10-30-2012 03:44 AM

Code:

<html>

<body>

<!-- Lots and lots of html code.. the line below is
not an example of what I want but is similar in
nature. The actual line (the real line, this is a
simplified example) is much longer and can
contain 0, 1 or more pieces of data I'd like to
recuperate (which always follow "class="entry"").
The example below contains three pieces of
data which I want to recuperate and store in a
text file, one per line: "elephant", "whale" and "snake"  -->

<div id="results" class="a">...<div class="entry">elephant... <div class="entry">whale...<div class="entry">snake</div><div class=... etc.

</body>
</html>


danielbmartin 10-30-2012 06:58 AM

Try this ...
Code:

# 1) Prefix all "<div" with line breaks.
# 2) Toss lines which do not begin with '<div class="entry"'
# 3) Keep whatever follows ">".
sed 's/<div/\n<div/g' $InFile  \
|grep '^<div class="entry">'  \
|cut -f2 -d'>'                \
> $OutFile1

Daniel B. Martin

rm_-rf_windows 10-30-2012 07:47 AM

Many thanks danielbmartin, I think we're on the right track!

What you gave me gives the following results on the actual .html source files I'm using (this is perhaps my fault, I gave you a simplified version):

Code:

elephant </div
snake </div
whale </div

I've therefore altered your little script, but given I'll be downloading a good number of web pages, perhaps it isn't the most efficient approach (I don't know much about how things work behind the scenes):

Code:

sed 's/<div/\n<div/g' temp.html  \
| grep '^<div class="entry">'  \
| sed 's#^<div class="entry">\(.*\) </div>#\1#g' \
>> outfile.txt

That gives me:

Code:

elephant
snake
whale

... which is what I want.

No need to bother with cat then... and cut is cool.

How would I alter the above if I wanted to redirect wget to stdout instead of creating and removing a temp.html file? Would that be faster?

Any further suggestions?

Many thanks,

rm

danielbmartin 10-30-2012 10:04 AM

Quote:

Originally Posted by rm_-rf_windows (Post 4818100)
How would I alter the above if I wanted to redirect wget to stdout instead of creating and removing a temp.html file?

I don't know wget. It's outside my scope of knowledge. Therefore this is a guess. Just make wget the first command in your pipe.
Code:

wget ((something or other))  \
|sed 's/<div/\n<div/g' \
|grep '^<div class="entry">'  \
|sed 's#^<div class="entry">\(.*\) </div>#\1#g' \
>> outfile.txt

Daniel B. Martin

David the H. 10-30-2012 11:24 AM

To be honest, line and regex-based tools like sed are not well-suited for use on html/xml, due to their flexible, nested, tag-based nature. While you can use them if the formatting is regular and well-structured, there's always a chance that they will fail.

In the long run it's better to use a tool with a dedicated parser for the syntax.


I've been playing with xmlstarlet recently, and you can use it fairly easily to extract data from xhtml-formatted files.

First, let's use a file that's actually formatted in proper html:
Code:

<html>
<body>

<!-- Lots and lots of html code.. -->

<div id="results" class="a">
    <div class="entry">elephant...</div><div class="entry">whale...</div>
    <div class="entry">snake...</div><div class="notanentry">etc...</div>
</div>

<!-- Lots more html code.. -->

</body>
</html>

Next, we can use htmltidy to convert it to xhtml:
Code:

tidy -n -asxml file.html 2>/dev/null >file.xhtml
Now we can use xmlstarlet to extract the values we want:

Code:

xmlstarlet sel --html -T -t -v '//*[@id="results"]/*[@class="entry"]' -n file.xhtml
The -v expression matches any entry in the file with the id of "results", and with the next level having a class of "entry", and prints the value.


The output, natch...

Code:

elephant...
whale...
snake...

xmlstarlet uses the xpath language, which can be a bit confusing at first, but is really powerful once you know how to use it (caveat: I'm still rather a beginner myself).

To match a specific kind of html tag, for instance, you apparently need to use the name function:

Code:

-v '//*[name()="div"][@id="results"]/*[@class="entry"]'
Now the first layer is limited to being a div tag.

Both tidy and xmlstarlet can read from stdin, BTW, so you could also pipe the commands together instead of using external files.

devnull10 10-30-2012 01:28 PM

To use with wget just do :

Code:

wget -O - | sed ...

ntubski 10-30-2012 01:35 PM

Quote:

Originally Posted by David the H. (Post 4818279)
Next, we can use htmltidy to convert it to xhtml:
Code:

tidy -n -asxml file.html 2>/dev/null >file.xhtml

I think you can also use
Code:

xmlstarlet fo --quiet --html file.html >file.xhtml
I'm not sure how the html parser compares to htmltidy.

Quote:

To match a specific kind of html tag, for instance, you apparently need to use the name function:
In general to match a tag you just put the tag name (eg: //div[@id="results"]) but you probably ran into the namespace issue.

David the H. 10-30-2012 02:17 PM

Thank you very much! That's great to know. I had wondered whether xmlstarlet could convert or work on html directly, but I couldn't locate how to do it in the documentation.

And yeah, I tried several things like '//div[@id="results"]' but couldn't get it to work, and the only example the documentation has for html uses the name function (actually local-name, but I deduced that you could replace it).

Using the fo command to convert the file appears to solve the namespace issue, so we should now be able to pipe it all together like this:

Code:

wget -O- source.com | xmlstarlet fo --quiet --html | xmlstarlet sel --html -T -t -v '//div[@id="results"]/div[@class="entry"]' -n
Please let me know if there's anything else we can do to shorten it up. I don't think we can do both operations with one instance, can we?

Edit: Hmm, I just ran a test, and at least with my example html file above, it looks like it works even without the formatting step. I know I've tried it before on other sources though without success. Perhaps it's rather finicky about the formatting?

ntubski 10-31-2012 02:58 AM

Quote:

Originally Posted by David the H. (Post 4818405)
Thank you very much! That's great to know. I had wondered whether xmlstarlet could convert or work on html directly, but I couldn't locate how to do it in the documentation.

The documentation could probably use some work.

Quote:

Code:

wget -O- source.com | xmlstarlet fo --quiet --html | xmlstarlet sel --html -T -t -v '//div[@id="results"]/div[@class="entry"]' -n
Please let me know if there's anything else we can do to shorten it up. I don't think we can do both operations with one instance, can we?
The sel subcommand doesn't take an --html option, although in principle there is no reason why it should not. You can use short options to fo (-Q for --quite and -H for --html).

Quote:

Edit: Hmm, I just ran a test, and at least with my example html file above, it looks like it works even without the formatting step. I know I've tried it before on other sources though without success. Perhaps it's rather finicky about the formatting?
Your example html file is also valid XML because every tag is closed, but it's legal and common not to close some tags (eg <br>) in HTML, which would not be valid XML.

rm_-rf_windows 11-03-2012 03:42 PM

Hi all,

Wow, lots of response. Cool tools!

I can't seem to get it working on my end. Could someone give me a working example that has been tested? A simple html document and the command? (xmlstarlet)

Another problem I've been having... With sed, I am able to insert new lines (\r or \n... or \\r \\n depending on whether an escape character is needed), but I can't seem to recognize a change of line. Several problems actually:
  • How do I reliably figure out the exact type of line change (linefeed, carriage return, strange MacIntosh line return ^M, etc.) of an html document (or any document for that matter)?
  • Once I've figured this out, how to I use sed to reliably recognize such characters in my regular expression?
My script works except when the stuff between thes "/ ... ( ) ... / \1 ... /" part includes an end of line.


Thx,


rm

David the H. 11-03-2012 05:16 PM

I've already shown you a very simple example of how it works. For more a more detailed version it would probably be better if you gave us an example or two examples of html files you would be likely to work with and what you'd want extracted from it.

ntubski obviously knows much more about it than me, but I will try to cover a few things I've learned so far about xmlstarlet and xpath.

To extract data use the sel subcommand. --html/-H is needed for xhtml input, obviously, -T means output as plain text, and -t indicates the beginning of the "template" options. The -m template option can be used to match entries (acting as a "foreach" expression), and -v is used to print values on a sucessful match. But as seen we can often just use -v alone for simple global matching+printing. Both single and double quote-marks can be used in expressions for nested string grouping. The expressions can be quite finicky about proper quoting.

As for the xpath expressions, Here are few of the basics as I understand them.

/ at the front of a path entry matches tags at only that single, specific level.
// in front of an entry makes it recursively match all sub-levels from that point on.
@ references tag attribute names.
[] brackets are used to limit a match to certain criteria.
There are a mass of functions available for doing things like printing substrings or mathematic expressions. See the reference link I gave earlier.
. can be used to reference a previously matched value.

Code:

xmlstarlet sel -T -t -m '//div[@id="text"]/p[not(@class="ignore")]' -v '.' -n
Assuming the input xhtml is properly formatted, then this should match all p tags that exist directly under the div with the id of "text", and that don't have a class attribute of "ignore", and print their contents, followed by newlines.


At this point though I usually still have to just keep trying various combinations until I get what I want. It's all rather complex and there's a lot to learn, but it does get easier with experience.

rm_-rf_windows 11-04-2012 02:37 AM

Okay, I guess my problem is that the html / xhtml files are not properly formatted. They contain some scripts and comments and it doesn't work.

The idea is to recuperate data from a large number of pages, so if each page has to be cleaned and reformatted, xmlstarlet may not be the way to go.

Good tool though. I have done some XPath and XQuery, but it's been a while.

rm_-rf_windows 11-04-2012 04:54 PM

xmlstarlet is cool and so is the php5-tidy program. I've opted nevertheless for more standard bash tools because the html pages in question are not always well-formed and I'm downloading info from a large number of pages.

I've got some specific questions regarding sed. Here they are:

Here's my sed statement/command:

Code:

sed 's#^<div class="entry">\(.*\)</div>.*[ ]*.*#\1#g'
The context of this statement is:
Code:

sed 's/<div/\n<div/g' temp.html  \
| grep '^<div class="entry">'  \
| sed 's#^<div class="entry">\(.*\)</div>.*[ ]*.*#\1#g' \
>> outfile.txt

Here are two problems I encounter.

When the html code is:
Code:

...<div class="entry">Hello there!</div></a></li></ul></div>
My result is:
Code:

Hello there!</div></a></li></ul>
When my html code is:
Code:

<div class="entry">He <B>is</B> here </div>
My result is:
Code:

He <B>is</B> here
Whereas what I want is:
Code:

He is here
So more generally, how can I say (to sed):
  • stop at first </div> even if there are more </div>'s after... disregard all that follows the first </div> (e.g., <div>Hello there!</div></a></li></ul></div>)
  • between <div> and </div>, what I want is anything except angle brackets (within the "content") and what might be inside them (e.g., for "He <B>is</B> here", I want "He is here")
Thx in advance,


rm


All times are GMT -5. The time now is 11:05 PM.