regexp help ...

pld · 03-15-2005, 02:52 PM

Hi all,

still learning regular expressions, and I have a little project I want to use them on.

I have a raw html page that I want to parse a handful of components off with. My string I am searching is something like:

<div class="myclass"...

Now, unfortunately, there is another place where this occurs in the file instead of on a newline:

<div id="something else"><div class="myclass"...

and what I want to grab has no other elements before it in the file:

<div class="myclass"

There are whitespaces before the lines i believe in some cases. So what type of regexp would I be using to single out the element I am looking for with nothing else in front of it? Oh, and I'm grepping the file for these lines, then awking later on for the rest of the data extraction...

rose_bud4201 · 03-15-2005, 03:45 PM

I would probably do something like

$ cat testfile | grep "whatever you're looking for" | grep "^<div id=\"myclass\""

Edit: I realized that I should probably add some more information.

I used cat because it's easier when chaining commands like this, and used grep again (instead of awk) mainly because I understand grep, and know next to nothing of awk. The regexp should be the same either way.

The ^ character specifies the beginning of the line. So ^stuff would find the word "stuff" in a file if and only if it occurred at the beginning of the line. It would not find "randomstuff", for example. ^random would work, however.

Conversely, the $ character specifies the end of a line. So stuff$ would wind the word "stuff" in a file if and only if it occurred at the end of the line. It _would_ find "stuff" in "randomstuff", but would not find "random".

Hope that helps!