LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-22-2011, 08:05 AM   #1
trscookie
Member
 
Registered: Apr 2004
Location: oxford
Distribution: gentoo
Posts: 463

Rep: Reputation: 30
java regex for links:


Hello all,

Im trying to extract the href of a <link> tag from a html page however as some links contain further preferences I seem to be unable to extract them, do you have any idea how I can write this:

Link:
PHP Code:
<link rel="stylesheet" type="text/css" media="screen,print" href="Home_files/Home.css" /> 
regex:
Code:
"(?i)<link\\s*href="
trying to extract the Home_files/Home.css, thanks in advance.

trscookie
 
Old 03-22-2011, 08:18 AM   #2
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
You need to add to the regex, then extract a group. This is from memory, so just a rough approach:

Code:
Pattern p = Pattern.compile("<link.*?href=\"(.*)\".*?/>");
Matcher m = p.matcher("<link rel="stylesheet" type="text/css" media="screen,print" href="Home_files/Home.css" />");

if (m.find()) {
    System.out.println(m.group(1));
}
This should print out the first group - in your case, the bit between the two quotes after the 'href' in a 'link' tag Again, it's untested, but you should be able to adapt it so that it works

Hope this helps,
 
1 members found this post helpful.
Old 03-22-2011, 12:25 PM   #3
trscookie
Member
 
Registered: Apr 2004
Location: oxford
Distribution: gentoo
Posts: 463

Original Poster
Rep: Reputation: 30
Ah, just got another quick question for some reason my regex is skipping one image:

Regex:
Code:
"(?i)<img(.*?)src\\s*=\\s*[\"'](.*?)[\"']"
Finding images:
Code:
Image Found: Home_files/logonew.png
Image Found: Home_files/shapeimage_1.png
Image Found: Home_files/shapeimage_2.jpg
But not finding this one:
Code:
:<img src="Home_files/logonew.png" alt="" style="border: none; height: 425px; width: 230px; " />
:<img usemap="#map1" id="shapeimage_1" src="Home_files/shapeimage_1.png" 
style="border: none; height: 359px; left: -6px; position: absolute; top: -5px; width: 226px; z-index: 1;  title="" />
<map name="map1" id="map1">
<area href="" title="" onmouseover="IMmouseover('shapeimage_1', '0');" alt="" 
onmouseout="IMmouseout('shapeimage_1', '0');" shape="rect" coords="11, 207, 177, 225" /></map>
<img style="height: 18px; left: 5px; position: absolute; top: 202px; width: 166px; "
 id="shapeimage_1_link_0" alt="shapeimage_1_link_0" src="Home_files/shapeimage_1_link_0.png" />
: <img src="Home_files/shapeimage_2.jpg" alt="" style="height: 301px; left: 0px; position: absolute; top: 0px; width: 723px; " />
Do you know why it wouldn't find the one in red?

trscookie.

Last edited by trscookie; 03-22-2011 at 12:28 PM.
 
Old 03-22-2011, 12:54 PM   #4
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Only thing I can see is that it's over 2 lines - it's possible the regex may not match over this... See if moving it all onto one line fixes it

EDIT: Got home and tested it, and yes the newline causes problems

Last edited by Snark1994; 03-22-2011 at 01:22 PM.
 
Old 03-22-2011, 01:09 PM   #5
trscookie
Member
 
Registered: Apr 2004
Location: oxford
Distribution: gentoo
Posts: 463

Original Poster
Rep: Reputation: 30
I think that you are right, I have changed it to:

Code:
"(?im)<img(.*?)src\\s*=\\s*[\"'](.*?)[\"']"
however this doesnt seem to have fixed it, is there any other options I can use?
 
Old 03-22-2011, 01:38 PM   #6
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Try using 's' instead
 
Old 03-22-2011, 08:06 PM   #7
trscookie
Member
 
Registered: Apr 2004
Location: oxford
Distribution: gentoo
Posts: 463

Original Poster
Rep: Reputation: 30
Just worked out that its because I have multiple occurrences on the same line, whats the best way I can split that up?

Cheers again,
trscookie.
 
Old 03-23-2011, 11:27 AM   #8
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Hm, reading the Matcher docs it looks like you should use the find() method to move onto the next match in the input string It depends how your code is laid out, though (I'm guessing you're feeding them to the matcher line-by-line?)
 
1 members found this post helpful.
Old 03-23-2011, 11:53 AM   #9
trscookie
Member
 
Registered: Apr 2004
Location: oxford
Distribution: gentoo
Posts: 463

Original Poster
Rep: Reputation: 30
humm, tried the .find() option but it would only find one per line, I've cheated a little and split the string at the end of each tag like so: for(String tag : string.split(">")) but it seems to work cheers for your help
 
Old 03-24-2011, 10:45 AM   #10
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
No problem It was nice to dig out java again
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Java Regex nova49 Programming 4 02-02-2011 05:33 PM
Perl to find regex and print following 5 lines after regex casperdaghost Linux - Newbie 3 08-29-2010 08:08 PM
help with designing Java program:file browser w/ regex search, possibly media player? jmd9qs Programming 0 11-02-2009 06:11 PM
Java read binary file in as string for RegEx mulciber Programming 1 12-18-2005 12:36 PM
Java can't find package, but PATH set correctly (j2se1.4.0, java.util.regex package) Ethan Programming 5 02-06-2004 09:55 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration