LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Regular expression to match a valid URL string (https://www.linuxquestions.org/questions/programming-9/regular-expression-to-match-a-valid-url-string-344565/)

vharishankar 07-19-2005 04:39 AM

Regular expression to match a valid URL string
 
I want a regular expression to validate an input with the format for a URL (http only).

What should I check for?

I know that a valid HTTP URL should begin with http://

But what other condition should I check for? I can build the expression myself, but I need some pointers on what are the rules for checking for a valid URL that should be a web page.

Allowed chars? Format? And so on.

Regards.

vharishankar 07-19-2005 04:57 AM

Is this a sufficient regular expression?

Code:

http://[\d\w][-._\d\w]*[\d\w]
http:// followed by a word/digit character, followed by any of word/digit/dot/dash/underscore any number of times including 0 times, followed by a word or a digit character.

NOTE:

There is another question. The eregi function in PHP looks for a match within the string. that is even if a substring matches the regexp, it returns a valid result.

How do I make sure that the whole string matches the regular expression.

I tried comparing the return value of the function with the length of the total string, but that didn't work.

taylor_venable 07-19-2005 07:16 AM

Valid URL Suggestions
 
That looks good for a first stab, and I'm not able to come up with a fool-proof way to match a valid URL by myself, but I noticed a few things about your expression:

(1) URLs can contain an ending slash
(2) URLs can contain escaped characters, like %20 for space
(3) \w includes \d (\w matches [a-zA-Z_0-9]
(4) You may need to escape / and :

Also, do you want to consider GET data from CGI scripts? If so, you need to add at least ? = &

This definitely isn't a complete list of everything you need, but hopefully it will help somewhat.

vharishankar 07-19-2005 08:17 AM

Thanks for the tips. I am not a guru at regular expressions.

I used the visual KDE regular expression editor to create the regexp string. ;)

Are you sure that the syntax has a problem there? Because an earlier regexp created using it worked fine with the eregi function.

What are the potential pitfalls of the regexp you see there?

keefaz 07-19-2005 08:37 AM

You could also avoid regexp
PHP Code:


$test 
= @fopen($url'r');

if(!
$test) {
    echo 
'this url could not be reached, I consider it invalid';
} else {
    echo 
'I can open this url so I consider it valid';
    
fclose($test);



vharishankar 07-19-2005 08:37 AM

The real question here is not what characters are allowed but sequences in which they are allowed.

For example my current expression validates even:

http://www............asdasd..asd///////////as.dsd.a????as

If you get the idea...

It's really tough building reg exp when you have only a vague idea of what you're trying to validate.

If anybody can help me with this, I'd be really grateful.:D

EDIT: I guess the LQ URL validator also has the same reg exp problem in recognizing valid URLs... ;) the string I typed was parsed as a URL.

vharishankar 07-19-2005 08:41 AM

Quote:

You could also avoid regexp
I thought of that keefaz, but I read the documentation of fopen and believe that opening a URI depends on the PHP.ini settings doesn't it? That is, if the fopen is not allowed to open URI resources then it will return an error. Again, in my shared host, I cannot possibly change the PHP.ini settings, so if that option is disabled then I'm stuck no?

I wanted a more "compatible" solution, but I guess your idea is quite good. I will use it as a last resort, thanks!

keefaz 07-19-2005 08:48 AM

If your web hosting provider does not allow fopen url's,
then it is not a good provider anyway, avoid it too ;)

taylor_venable 07-19-2005 09:19 AM

An Almost-Working Perl Idea
 
Here's a little idea I've been cooking up in Perl - the script parsing has an issue that I can't put my finger on, though.

Code:

if($url =~ m/http:\/\/
    \w+
    \.
    ([\w\.]+)
    \/*
    (?(?<=\?)
        \w+=\w+
        (?(?<=&)
            \w+=\w+
        )+
    )/x) {
    print "VALID\t\t$url\n";
} else {
    print "NOT VALID\t$url\n";
}

It seems to work OK for basic URLs, but when it gets into the CGI parsing, the URL always passes, even for something like "http://www.foo.com/script.pl?alpha", which I'm pretty sure isn't valid. It probably has to do with those zero-width lookbehind thingies that I don't quite understand.

vharishankar 07-19-2005 09:29 AM

Quote:

If your web hosting provider does not allow fopen url's,
then it is not a good provider anyway, avoid it too ;)
It's not a case of that. Since I'm planning to release it under GPL and allowing anybody to use it, I thought it should be more generic. ;)

keefaz 07-19-2005 09:47 AM

For my part, I 'd put an indication that my program requires
allow_url_fopen = On in php.ini.

Some php packages requires PEAR, some requires PHP compiled
and installed as CGI, some requires GD library etc...

vharishankar 07-19-2005 10:14 AM

Ok. In any case I need to check if the http:// is part of the URL because otherwise the <a href=""> tag in HTML treats it as a relative URL.

hlyrad 07-21-2005 06:02 PM

Heres a lovely little regexp allready submitted 5 years ago. It takes a URL and changes it into a link.
Check it out.
http://aspn.activestate.com/ASPN/Coo...x/Recipe/59864

vharishankar 07-21-2005 09:17 PM

That is a monster of a regular expression. :eek: I really don't need that level of strictness in any case.

Moreover my aim is not to convert to a URL within a text. Just to validate the format in a URL inputted.

Thanks for the link, though. I'll see if I can use that.


All times are GMT -5. The time now is 03:10 AM.