LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-14-2002, 05:19 AM   #1
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Rep: Reputation: 46
regular expression for parsing html tags


I have a file of HTML for which I'd like to return only the <IMG> tag attributes.

I've tried this:

grep -h -i "<[^>.?img.*?src.*^<]/>" plainhtmlfile.txt > imagetags.txt

but of course, this regex says "give me the whole line of text between left and right angle brackets (opening and closing html tags) with img and src appearing somewhere in there". It's giving me all the other crap in the file too though of course, as it has to be greedy (the file could contain many attributes between the img and src tag).

I'd like to return only the attributes inside the <IMG> tag.

Gaaah!

Does anyone have any ideas?
 
Old 10-14-2002, 12:20 PM   #2
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Original Poster
Rep: Reputation: 46
Well, here's how I did it (they use another OS at work btw):

Code:
# suggested usage: 
# >perl imglinkxtract.pl

# this script simply returns the attributes of all the HTML <IMG> tags, and 
# can be used to point to a HTML or text file.
# the output is returned to the console so you might want to pipe it's output like so:
# >perl imglinkxtract.pl > imagetags.txt
# but you might need to use a http://www.cygwin.com bash shell to do this.


require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&cb);

my @imgs = ();

# this subroutine returns the image tag's attributes
sub cb {
	my($tag, %attr) = @_;
	push (@imgs, values %attr) if $tag eq 'img';
}

#todo: prompt the user for the file, not hardcoded.
$p->parse_file("imagetags.txt");

# print the output to the console with newlines
print join ("\n", @imgs), "\n";
 
Old 10-14-2002, 01:15 PM   #3
vladkrack
Member
 
Registered: Oct 2002
Location: Curitiba - Brazil
Distribution: Conectiva
Posts: 334

Rep: Reputation: 30
Hi Bert,

Here's how I did it using sed:

# sed -n 's/.*\(img.src\)\=\([^[:space:]]*\).*/\2/p' plainhtmlfile.txt > imagetags.txt

and without "

# sed -n 's/.*\(img.src\)\=\"\([^[:space:]]*\)\".*/\2/p' plainhtmlfile.txt > imagetags.txt
 
Old 10-14-2002, 04:31 PM   #4
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Original Poster
Rep: Reputation: 46
Hey vladkrack, thanks. That does it pretty nicely too.

I've found that doing this with a stream editor though sometimes returns the path and appears to struggle with long and funky filenames. The output can do this in places:

...
...
"/img/calcutta.jpg"
"9884_claudius.gif"
"/img/WWIKaiserWilhelmII.jpg"
"/img/charlemagneinpomp.jpg"
"/img/Chartism
"/img/Pankhurst,
"/img/castlescotland.jpg"
...
...

Of course this has nothing to do with the efficiency of your regex but the shoddy quality of the htmltags.txt files which was put together by end users who <b> think <u> nothing </b> of </u> nesting tags and using narratives instead of file naming conventions.jpg!

The advantage of doing it with perl is that it uses a built-in HTML parser (which is almost certainly cheating ...)

 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with regular expression aecaudel Programming 6 11-04-2005 05:28 AM
check HTML colour using regular expression kaon Linux - General 1 07-27-2005 07:46 AM
Parsing of a simple expression. KissDaFeetOfSean Programming 1 07-18-2005 04:45 PM
Parsing XML tags with php, can't get attributes of a tag jimieee Programming 1 05-05-2004 10:32 AM
Anyone know regular expression? ahhua Linux - Software 1 12-04-2003 08:13 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration