Regular expression to match 4 or more alpha characters
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Regular expression to match 4 or more alpha characters
I'm trying to test for 4 or more alphabetic characters. Using regex with the 'expr' command, I can get 4 or more as long as they are at the beginning of the string as follows:
Code:
expr match "abcdef123" '\([a-z|A-Z]*\)\{4,\}'
But if I add any non-alpha character at the beginning of the string the match fails. From what I read the default for 'expr match' is to start matching from the beginning of the string.
Can I change this default behavior to match anywhere in the string? If not, is there another way to extract 4 or more alpha characters using regular expressions? Thanks in advance!
First of all, your original regular expression '\([a-z|A-Z]*\)\{4,\}' will match alphabetic character strings of ANY length (not just greater than 4) because of the * character.
After fixing that, all you need to do is add something to the regular expression that will match the beginning part of the string: .*
The following:
Code:
expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'
Finds abcd in the string "1a2abcd4ef123".
Although a match guarantees the presence of a string of 4 or more alphabetic characters, the above expression won't return that string - it will just return the last string of 4 letters (since any letters preceeding those final 4 will be matched by .*).
So the above expression only finds defg in the string "abcdefg" ... which is probably still not what you're looking for.
Last edited by zhangmaike; 10-09-2006 at 01:23 AM.
I'm trying to test for 4 or more alphabetic characters.
Sometimes, there's no need to use regexp.
If you have Python:
Code:
#!/usr/bin/python
total = 0
s = "abcdef123"
for ch in s:
if ch.isalpha(): #test if alphabet
total = total + 1
if total > 4:
print "There are more than 4 alphabetic chars"
I am trying to work strictly within bash here so thanks for the python suggestion but it's N/A in my case (C would also work but...).
I think the expr match does return the string:
Code:
dbrazziel@emach433:~/scripts$ expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'
abcd
The string I'll be working with is only 8 characters long,
so I need a way to get strings of length 4 through 8, and I think I can do that with a combination of 'expr match' and 'expr index'.
Will start by removing non-alpha chars at beginning and end
Ghostdog, I loved that movie! :-)
OK, I think I will trim the longest non-alpha strings from the front and back of the string to begin with, that should simplify things a little. Here's what I have come up with so far:
Code:
index=$(expr index $pw [[:alpha:]]) # Get index of first alpha char
pwa=$pw # copy password string to intermediate string
while [ $index -ne 1 ]
do
pwa=${pwa#[^[:alpha:]]} # alpha string
index=$(expr index $pwa [[:alpha:]])
done
echo "After stripping leading non-alpha characters, pwa = $pwa."
# Trim trailing non-alpha characters from pwa
pwa=${pwa%%[^[:alpha:]]*}
echo "After trimming trailing non-alpha characters, pwa = $pwa."
Thanks soggy. To clarify, the string won't be exactly 8 characters, it must be at least 8 characters (I made a mistake in my earlier post). Also, the alpha characters may appear anywhere in the string, interspersed with digits (0-9), and special characters (@, #, $, %, &, *, +, -, or =).
OK, that taken care of, let's see what happens if there's no trailing alpha character. I expect(ed) the '32+' to be
the 'longest substring at the end of the string matching the RE':
Agonizing reappraisal: expr index operates strictly on (sub)strings, not regex's
Ugh, as soon as I entered a string without the character 'a', the script would loop endlessly because 'expr' evaluated the character class [[:alpha:]] as the string 'a'. I guess I should thoroughly read and understand the doc before starting to code (duh). Good way to really learn is to totally bollux something up lol
Well, back to the drawing board, planning some combination of 'tr' command (i.e. tr -cd [:alpha:] to strip away non-alpha characters, then do some substring matching with expr.
Regular expressions and substring extraction of 4 character alpha strings
SUCCESS! Here's the script fragment:
Code:
# Check 4 character alphabetic strings against
# dictionary. First, strip non-alpha characters from
# the string
pwa=$(echo $pw | tr -cd '[:alpha:]') # Delete the complement of the alpahbetic set
lenpwa=${#pwa}
wpwa=$pwa # Work password
lenwpwa=$lenpwa # Length of work password
if [ $lenpwa -ge 4 ]
then
lenss=4 # Start substring matching
ssindex=0
pwaindex=0
while [ $lenwpwa -ge 4 ]
do
while [ $lenss -le $lenwpwa ]
do
ss=${wpwa:$ssindex:$lenss}
#echo $ss
if [ $(grep -x $ss $DICTFILE) ]
then
echo "Invalid password: $pw contains word '$ss'."
echo "Passwords must not contain words found in the dictionary."
exit $E_DATAERR
fi
let 'lenss+=1'
done
let 'pwaindex+=1'
wpwa=${pwa:pwaindex}
lenwpwa=${#wpwa} # New length of alpha string
ssindex=0 # reset index to substring
lenss=4 # reset minimum length of substring
done
fi
Well, yeah, it extracts 4 character or longer strings but doesn't parse substrings within those strings. The script checks strings 4 characters or longer to see if they contain dictionary words. So while:
echo "abworde11werty" | egrep -o '[a-zA-Z]{4,}'
reports:
abworde
werty
It doesn't move forward through substring 'abworde' to 'bworde' then 'worde' where it would find 'word' within the initial string. The code I have above does that.
Correct me if I'm wrong, I have a hard time reading, let alone copy-&-pasting, wide <code> posts in my browser,
but doesn't your code ignore "words" that are split by non-alpha characters?
I mean '12al34pha65' would have the "alpha" parsed out & objected to.
If this so & you are cool w/ this behavior, do please document it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.