LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices



Reply
 
Search this Thread
Old 10-14-2002, 06:19 AM   #1
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Rep: Reputation: 46
regular expression for parsing html tags


I have a file of HTML for which I'd like to return only the <IMG> tag attributes.

I've tried this:

grep -h -i "<[^>.?img.*?src.*^<]/>" plainhtmlfile.txt > imagetags.txt

but of course, this regex says "give me the whole line of text between left and right angle brackets (opening and closing html tags) with img and src appearing somewhere in there". It's giving me all the other crap in the file too though of course, as it has to be greedy (the file could contain many attributes between the img and src tag).

I'd like to return only the attributes inside the <IMG> tag.

Gaaah!

Does anyone have any ideas?
 
Old 10-14-2002, 01:20 PM   #2
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Original Poster
Rep: Reputation: 46
Well, here's how I did it (they use another OS at work btw):

Code:
# suggested usage: 
# >perl imglinkxtract.pl

# this script simply returns the attributes of all the HTML <IMG> tags, and 
# can be used to point to a HTML or text file.
# the output is returned to the console so you might want to pipe it's output like so:
# >perl imglinkxtract.pl > imagetags.txt
# but you might need to use a http://www.cygwin.com bash shell to do this.


require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&cb);

my @imgs = ();

# this subroutine returns the image tag's attributes
sub cb {
	my($tag, %attr) = @_;
	push (@imgs, values %attr) if $tag eq 'img';
}

#todo: prompt the user for the file, not hardcoded.
$p->parse_file("imagetags.txt");

# print the output to the console with newlines
print join ("\n", @imgs), "\n";
 
Old 10-14-2002, 02:15 PM   #3
vladkrack
Member
 
Registered: Oct 2002
Location: Curitiba - Brazil
Distribution: Conectiva
Posts: 334

Rep: Reputation: 30
Hi Bert,

Here's how I did it using sed:

# sed -n 's/.*\(img.src\)\=\([^[:space:]]*\).*/\2/p' plainhtmlfile.txt > imagetags.txt

and without "

# sed -n 's/.*\(img.src\)\=\"\([^[:space:]]*\)\".*/\2/p' plainhtmlfile.txt > imagetags.txt
 
Old 10-14-2002, 05:31 PM   #4
Bert
Senior Member
 
Registered: Jul 2001
Location: 406292E 290755N
Distribution: GNU/Linux Slackware 8.1, Redhat 8.0, LFS 4.0
Posts: 1,004

Original Poster
Rep: Reputation: 46
Hey vladkrack, thanks. That does it pretty nicely too.

I've found that doing this with a stream editor though sometimes returns the path and appears to struggle with long and funky filenames. The output can do this in places:

...
...
"/img/calcutta.jpg"
"9884_claudius.gif"
"/img/WWIKaiserWilhelmII.jpg"
"/img/charlemagneinpomp.jpg"
"/img/Chartism
"/img/Pankhurst,
"/img/castlescotland.jpg"
...
...

Of course this has nothing to do with the efficiency of your regex but the shoddy quality of the htmltags.txt files which was put together by end users who <b> think <u> nothing </b> of </u> nesting tags and using narratives instead of file naming conventions.jpg!

The advantage of doing it with perl is that it uses a built-in HTML parser (which is almost certainly cheating ...)

 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with regular expression aecaudel Programming 6 11-04-2005 06:28 AM
check HTML colour using regular expression kaon Linux - General 1 07-27-2005 08:46 AM
Parsing of a simple expression. KissDaFeetOfSean Programming 1 07-18-2005 05:45 PM
Parsing XML tags with php, can't get attributes of a tag jimieee Programming 1 05-05-2004 11:32 AM
Anyone know regular expression? ahhua Linux - Software 1 12-04-2003 09:13 AM


All times are GMT -5. The time now is 07:40 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration