LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-23-2019, 05:31 PM   #1
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,032

Rep: Reputation: 76
parsing passwords with regex in javascript


Hi,
I'm taking this linuxacademy lab test where I'm testing parsing passwords. I'm using positive lookaheads for it, as this is what I've seen was being used. this is the relevant line of the js line:
Code:
var password_val = /^(?=[^a-z]*[a-z])(?=[^\d]*\d)/;
I'm actually having a hard time understanding the whole concept. I know what lookbehinds/lookaheads are, but I don't understand at all how this regex can actually work. Under normal circumstances, I'd use them to simply match a regex that has a certain string of character ahead or behind and that's that, but when parsing passwords, for example, this become much more abstract.

First of all, what I've inferred from this regex is that different groups (with brackets) work in this case in any order, which I find intriguing. So if I had written the digit group before the lowercase group, I'd have got the same result.

Of course, this: [^a-z]*[a-z] means any number of non-lowercase alphabetical character (including 0) that would end in an alphabetical lowercase character. But what I don't understand is how you infer from this that you have to have at least one lowercase character.

What I don't understand is why we really need to use [^a-z] and not any other character for that matter, such as .*
I've actually already tested it with .* and it works just like before.
So can anyone break it down for me and explain the rationale behind it?

Thanks!
 
Old 08-23-2019, 09:24 PM   #2
Skaperen
Senior Member
 
Registered: May 2009
Location: WV, USA
Distribution: Xubuntu, Slackware, Amazon Linux
Posts: 1,903
Blog Entries: 21

Rep: Reputation: 125Reputation: 125
i didn't know passwords needed to be parsed. what if my password has a syntax error? is this happening when someone is setting or changing their password or when they are just typing it in to login?
 
1 members found this post helpful.
Old 08-23-2019, 10:27 PM   #3
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,498

Rep: Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806
Quote:
Originally Posted by vincix View Post
First of all, what I've inferred from this regex is that different groups (with brackets) work in this case in any order
It's not that groups in any order, but specifically lookaheads/behinds work in any order. That's because they match without consuming any characters.

Quote:
Of course, this: [^a-z]*[a-z] means any number of non-lowercase alphabetical character (including 0) that would end in an alphabetical lowercase character. But what I don't understand is how you infer from this that you have to have at least one lowercase character.
If it ends in a lowercase character, then there must be at least that one lowercase character.

Quote:
What I don't understand is why we really need to use [^a-z] and not any other character for that matter, such as .*
It can work, but it's less efficient. Suppose you had AxABC, the match would first try to match .* against the whole string AxABC, then see that [a-z] doesn't match against end of string, so try again with .* matching just AxAB, but [a-z] doesn't match with C, and so on until it tries with .* matching A and [a-z] matching x.
 
1 members found this post helpful.
Old 08-24-2019, 06:36 AM   #4
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,032

Original Poster
Rep: Reputation: 76
@Skaperen, how do you mean a syntax error in a password? I'm not sure if that's the proper way of describing whatever might be wrong with a password, but maybe I'm mistaken. The password might contain characters unidentified by the application (let's say the backend) that is going to process it, but I believe that's not a syntax error.

In my case, this is happening when someone creates a new account, but it's only html/css/js code, there's no actual backend and js errors out with pop-up message when the password doesn't fulfil certain criteria (the regex).

If you don't use some sort of parsing, then how else would you be able to enforce password policies? This isn't a rhetorical question, I am actually curious. Whatever the answer to that is, from what I've searched on the internet, regex parsing with lookaheads seems to be one of the accepted ways of doing it.

@ntubski Thank you for your answer. It so happened that not long after I wrote the post I realised how the regex is parsed, and I find it intriguing and really interesting - the fact that the regex engine doesn't move from the ^ (the first character) and tests every lookahead. It's a completely new way for me of thinking about regex.

I've finally got your very helpful explanation regarding the way the engine would parse a string such as "AxABC", so it's parsing it from right to left, going through each character at a time, until it finds a match.
But I still don't really understand how the parsing is going to be done if using [^a-z] instead of .*
It would check for any number of non-lowercase characters and then one lowercase character at the end. So would that mean that it starts with "A", which doesn't match, then it would go on to "Ax", which matches, so that means that it would do only one iteration?
If that's correct, then I don't see how that is more efficient, because the "x" could have been placed more to the right/at the end of the string, right?

Last edited by vincix; 08-24-2019 at 06:39 AM.
 
Old 08-24-2019, 05:25 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,498

Rep: Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806
Quote:
Originally Posted by vincix View Post
It would check for any number of non-lowercase characters and then one lowercase character at the end. So would that mean that it starts with "A", which doesn't match, then it would go on to "Ax", which matches, so that means that it would do only one iteration?
If that's correct, then I don't see how that is more efficient, because the "x" could have been placed more to the right/at the end of the string, right?
If the only lower case letter is right at the end of the string, then there is not much difference, yes. This regex is simple enough to that it doesn't really make much difference either way. But if you extend it to 2 letters [^a-z]*[a-z][^a-z]*[a-z] vs .*[a-z].*[a-z] the latter can take O(n^2) time, because it requires repeating the backtracking for the second .* for each backtrack of the first.
 
1 members found this post helpful.
Old 08-30-2019, 08:02 AM   #6
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Slack, Debian, Mint, Puppy, Raspbian
Posts: 3,462

Rep: Reputation: 218Reputation: 218Reputation: 218
As a pretty seasoned user of REs:

... to be honest, it's a damn site easier to parse in many steps than try to be a clever boy and
do it all in one line. maybe: check for funny chars, check length, check rude words etc.
unless you need to save every nanosecond.
Then you can comment it, and others can understand it and it's less likely to be wrong.

And primarily, when you look at it 6 months later you don't go "WTF was I doing here?"
 
1 members found this post helpful.
Old 08-31-2019, 01:52 AM   #7
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309
^ Wise words.
Also see: Master Foo and the Programming Prodigy
 
Old 09-01-2019, 03:28 PM   #8
boughtonp
Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 32

Rep: Reputation: 1
The answer to parsing passwords with regex and any other attempt to implement such complexity rules: DO NOT DO THIS.

Use an already built well-tested strength meter like zxcvbn which achieves the goal way better.



The rest of this post is in the context of "I'm taking this linuxacademy lab test [and] having a hard time understanding..."

There are situations where this multi-lookahead construct is useful (particularly when you don't have the the luxury of multiple simple patterns, like in an IDE), so it is worth understanding and keeping it in mind.


Quote:
Originally Posted by vincix View Post
I've finally got [ntubski's] very helpful explanation regarding the way the engine would parse a string such as "AxABC", so it's parsing it from right to left, going through each character at a time, until it finds a match.
It's not parsing from right to left - regex engines work left to right, but when using a greedy quantifier they consume as much as possible (going from left to right) - in the case of .* that means the entire remainder of the string/line - and then, if the next regex section fails to match, it backtracks one atom at a time.

(Where "atom" is usually a single character, but may be an atomic group or the result of a possessive quantifier; constructs not available in JS regex.)

It is almost always better to replace .* with a more specific expression (often the negation of the following character set, as already suggested).

If you know the expected match is closer to the end of the string than the current position, .* can be more efficient - when regex performance matters, it's always a case of considering best/worst/common inputs and adapting the pattern as appropriate (with a suitable comment if the reason for something is not obvious).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[PHP] Parsing user input - regex? displace Programming 4 10-02-2013 06:56 AM
[SOLVED] Parsing mountpoint from /etc/mtab (parsing "string") in C deadeyes Programming 3 09-06-2011 05:33 PM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
Regex: How do I put this on one line? (Time-date parsing) RavenLX Programming 2 01-22-2009 10:56 AM
BASH RegEx file name parsing Hewson Linux - General 7 04-27-2007 05:37 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration