LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-19-2005, 04:39 AM   #1
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Rep: Reputation: 121Reputation: 121
Regular expression to match a valid URL string


I want a regular expression to validate an input with the format for a URL (http only).

What should I check for?

I know that a valid HTTP URL should begin with http://

But what other condition should I check for? I can build the expression myself, but I need some pointers on what are the rules for checking for a valid URL that should be a web page.

Allowed chars? Format? And so on.

Regards.
 
Old 07-19-2005, 04:57 AM   #2
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
Is this a sufficient regular expression?

Code:
http://[\d\w][-._\d\w]*[\d\w]
http:// followed by a word/digit character, followed by any of word/digit/dot/dash/underscore any number of times including 0 times, followed by a word or a digit character.

NOTE:

There is another question. The eregi function in PHP looks for a match within the string. that is even if a substring matches the regexp, it returns a valid result.

How do I make sure that the whole string matches the regular expression.

I tried comparing the return value of the function with the length of the total string, but that didn't work.

Last edited by vharishankar; 07-19-2005 at 05:05 AM.
 
Old 07-19-2005, 07:16 AM   #3
taylor_venable
Member
 
Registered: Jun 2005
Location: Indiana, USA
Distribution: OpenBSD, Ubuntu
Posts: 892

Rep: Reputation: 40
Valid URL Suggestions

That looks good for a first stab, and I'm not able to come up with a fool-proof way to match a valid URL by myself, but I noticed a few things about your expression:

(1) URLs can contain an ending slash
(2) URLs can contain escaped characters, like %20 for space
(3) \w includes \d (\w matches [a-zA-Z_0-9]
(4) You may need to escape / and :

Also, do you want to consider GET data from CGI scripts? If so, you need to add at least ? = &

This definitely isn't a complete list of everything you need, but hopefully it will help somewhat.
 
Old 07-19-2005, 08:17 AM   #4
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
Thanks for the tips. I am not a guru at regular expressions.

I used the visual KDE regular expression editor to create the regexp string.

Are you sure that the syntax has a problem there? Because an earlier regexp created using it worked fine with the eregi function.

What are the potential pitfalls of the regexp you see there?
 
Old 07-19-2005, 08:37 AM   #5
keefaz
Senior Member
 
Registered: Mar 2004
Distribution: Slackware
Posts: 4,559

Rep: Reputation: 121Reputation: 121
You could also avoid regexp
PHP Code:

$test 
= @fopen($url'r');

if(!
$test) {
    echo 
'this url could not be reached, I consider it invalid';
} else {
    echo 
'I can open this url so I consider it valid';
    
fclose($test);


Last edited by keefaz; 07-19-2005 at 08:46 AM.
 
Old 07-19-2005, 08:37 AM   #6
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
The real question here is not what characters are allowed but sequences in which they are allowed.

For example my current expression validates even:

http://www............asdasd..asd///////////as.dsd.a????as

If you get the idea...

It's really tough building reg exp when you have only a vague idea of what you're trying to validate.

If anybody can help me with this, I'd be really grateful.

EDIT: I guess the LQ URL validator also has the same reg exp problem in recognizing valid URLs... the string I typed was parsed as a URL.

Last edited by vharishankar; 07-19-2005 at 08:38 AM.
 
Old 07-19-2005, 08:41 AM   #7
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
Quote:
You could also avoid regexp
I thought of that keefaz, but I read the documentation of fopen and believe that opening a URI depends on the PHP.ini settings doesn't it? That is, if the fopen is not allowed to open URI resources then it will return an error. Again, in my shared host, I cannot possibly change the PHP.ini settings, so if that option is disabled then I'm stuck no?

I wanted a more "compatible" solution, but I guess your idea is quite good. I will use it as a last resort, thanks!
 
Old 07-19-2005, 08:48 AM   #8
keefaz
Senior Member
 
Registered: Mar 2004
Distribution: Slackware
Posts: 4,559

Rep: Reputation: 121Reputation: 121
If your web hosting provider does not allow fopen url's,
then it is not a good provider anyway, avoid it too
 
Old 07-19-2005, 09:19 AM   #9
taylor_venable
Member
 
Registered: Jun 2005
Location: Indiana, USA
Distribution: OpenBSD, Ubuntu
Posts: 892

Rep: Reputation: 40
An Almost-Working Perl Idea

Here's a little idea I've been cooking up in Perl - the script parsing has an issue that I can't put my finger on, though.

Code:
if($url =~ m/http:\/\/
    \w+
    \.
    ([\w\.]+)
    \/*
    (?(?<=\?)
        \w+=\w+
        (?(?<=&)
            \w+=\w+
        )+
    )/x) {
    print "VALID\t\t$url\n";
} else {
    print "NOT VALID\t$url\n";
}
It seems to work OK for basic URLs, but when it gets into the CGI parsing, the URL always passes, even for something like "http://www.foo.com/script.pl?alpha", which I'm pretty sure isn't valid. It probably has to do with those zero-width lookbehind thingies that I don't quite understand.
 
Old 07-19-2005, 09:29 AM   #10
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
Quote:
If your web hosting provider does not allow fopen url's,
then it is not a good provider anyway, avoid it too
It's not a case of that. Since I'm planning to release it under GPL and allowing anybody to use it, I thought it should be more generic.
 
Old 07-19-2005, 09:47 AM   #11
keefaz
Senior Member
 
Registered: Mar 2004
Distribution: Slackware
Posts: 4,559

Rep: Reputation: 121Reputation: 121
For my part, I 'd put an indication that my program requires
allow_url_fopen = On in php.ini.

Some php packages requires PEAR, some requires PHP compiled
and installed as CGI, some requires GD library etc...
 
Old 07-19-2005, 10:14 AM   #12
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
Ok. In any case I need to check if the http:// is part of the URL because otherwise the <a href=""> tag in HTML treats it as a relative URL.
 
Old 07-21-2005, 06:02 PM   #13
hlyrad
Member
 
Registered: Jul 2005
Location: Ab Ca
Distribution: Redhat EL Sun Mac OSX FC 3.0 & 4.0
Posts: 44

Rep: Reputation: 15
Heres a lovely little regexp allready submitted 5 years ago. It takes a URL and changes it into a link.
Check it out.
http://aspn.activestate.com/ASPN/Coo...x/Recipe/59864
 
Old 07-21-2005, 09:17 PM   #14
vharishankar
Senior Member
 
Registered: Dec 2003
Posts: 3,142
Blog Entries: 4

Original Poster
Rep: Reputation: 121Reputation: 121
That is a monster of a regular expression. I really don't need that level of strictness in any case.

Moreover my aim is not to convert to a URL within a text. Just to validate the format in a URL inputted.

Thanks for the link, though. I'll see if I can use that.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with regular expression aecaudel Programming 6 11-04-2005 05:28 AM
Need help with Regular Expression subaruwrx Linux - Newbie 6 09-04-2004 07:48 PM
Regular Expression Help WeNdeL Linux - General 1 08-14-2003 10:08 AM
Regular Expression slizadel Programming 4 07-28-2003 05:16 AM
regular expression gumby Programming 3 07-15-2003 12:13 PM


All times are GMT -5. The time now is 11:42 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration