LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   regular expression for parsing html tags (https://www.linuxquestions.org/questions/linux-software-2/regular-expression-for-parsing-html-tags-32752/)

Bert 10-14-2002 05:19 AM

regular expression for parsing html tags
 
I have a file of HTML for which I'd like to return only the <IMG> tag attributes.

I've tried this:

grep -h -i "<[^>.?img.*?src.*^<]/>" plainhtmlfile.txt > imagetags.txt

but of course, this regex says "give me the whole line of text between left and right angle brackets (opening and closing html tags) with img and src appearing somewhere in there". It's giving me all the other crap in the file too though of course, as it has to be greedy (the file could contain many attributes between the img and src tag).

I'd like to return only the attributes inside the <IMG> tag.

Gaaah!

Does anyone have any ideas?

Bert 10-14-2002 12:20 PM

Well, here's how I did it (they use another OS at work btw):

Code:

# suggested usage:
# >perl imglinkxtract.pl

# this script simply returns the attributes of all the HTML <IMG> tags, and
# can be used to point to a HTML or text file.
# the output is returned to the console so you might want to pipe it's output like so:
# >perl imglinkxtract.pl > imagetags.txt
# but you might need to use a http://www.cygwin.com bash shell to do this.


require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&cb);

my @imgs = ();

# this subroutine returns the image tag's attributes
sub cb {
        my($tag, %attr) = @_;
        push (@imgs, values %attr) if $tag eq 'img';
}

#todo: prompt the user for the file, not hardcoded.
$p->parse_file("imagetags.txt");

# print the output to the console with newlines
print join ("\n", @imgs), "\n";


vladkrack 10-14-2002 01:15 PM

Hi Bert,

Here's how I did it using sed:

# sed -n 's/.*\(img.src\)\=\([^[:space:]]*\).*/\2/p' plainhtmlfile.txt > imagetags.txt

and without "

# sed -n 's/.*\(img.src\)\=\"\([^[:space:]]*\)\".*/\2/p' plainhtmlfile.txt > imagetags.txt

Bert 10-14-2002 04:31 PM

Hey vladkrack, thanks. That does it pretty nicely too.

I've found that doing this with a stream editor though sometimes returns the path and appears to struggle with long and funky filenames. The output can do this in places:

...
...
"/img/calcutta.jpg"
"9884_claudius.gif"
"/img/WWIKaiserWilhelmII.jpg"
"/img/charlemagneinpomp.jpg"
"/img/Chartism
"/img/Pankhurst,
"/img/castlescotland.jpg"
...
...

Of course this has nothing to do with the efficiency of your regex but the shoddy quality of the htmltags.txt files which was put together by end users who <b> think <u> nothing </b> of </u> nesting tags and using narratives instead of file naming conventions.jpg!

The advantage of doing it with perl is that it uses a built-in HTML parser (which is almost certainly cheating ...)

:D


All times are GMT -5. The time now is 06:27 AM.