Regular expression to match 4 or more alpha characters

sixerjman · 10-08-2006, 11:54 PM

I'm trying to test for 4 or more alphabetic characters. Using regex with the 'expr' command, I can get 4 or more as long as they are at the beginning of the string as follows:

Code:

expr match "abcdef123" '\([a-z|A-Z]*\)\{4,\}'

But if I add any non-alpha character at the beginning of the string the match fails. From what I read the default for 'expr match' is to start matching from the beginning of the string.

Can I change this default behavior to match anywhere in the string? If not, is there another way to extract 4 or more alpha characters using regular expressions? Thanks in advance!

zhangmaike · 10-09-2006, 12:56 AM

First of all, your original regular expression '$[a-z|A-Z]*$\{4,\}' will match alphabetic character strings of ANY length (not just greater than 4) because of the * character.

After fixing that, all you need to do is add something to the regular expression that will match the beginning part of the string: .*

The following:

Code:

expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'

Finds abcd in the string "1a2abcd4ef123".

Although a match guarantees the presence of a string of 4 or more alphabetic characters, the above expression won't return that string - it will just return the last string of 4 letters (since any letters preceeding those final 4 will be matched by .*).

So the above expression only finds defg in the string "abcdefg" ... which is probably still not what you're looking for.

ghostdog74 · 10-09-2006, 02:29 AM

Quote:

Originally Posted by sixerjman

I'm trying to test for 4 or more alphabetic characters.

Sometimes, there's no need to use regexp.
If you have Python:

Code:

#!/usr/bin/python
total = 0
s = "abcdef123"
for ch in s:
 	if ch.isalpha(): #test if alphabet
 		total = total + 1 
if total > 4:
   print "There are more than 4 alphabetic chars"

sixerjman · 10-09-2006, 11:55 AM

I am trying to work strictly within bash here so thanks for the python suggestion but it's N/A in my case (C would also work but...).

I think the expr match does return the string:

Code:

dbrazziel@emach433:~/scripts$ expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'
abcd

The string I'll be working with is only 8 characters long,
so I need a way to get strings of length 4 through 8, and I think I can do that with a combination of 'expr match' and 'expr index'.

I'll let you know how I come out. Thanks again!

soggycornflake · 10-09-2006, 12:59 PM

Quote:

The string I'll be working with is only 8 characters long,
so I need a way to get strings of length 4 through 8,

There's no need to use regex just to match strings of a particular length, the ${#...} operator returns the length of the parameter.

Code:

str="abcdef123"

if [[ ${#str} -gt 4 ]]; then
        ....
fi

sixerjman · 10-09-2006, 04:08 PM

Ghostdog, I loved that movie! :-)

OK, I think I will trim the longest non-alpha strings from the front and back of the string to begin with, that should simplify things a little. Here's what I have come up with so far:

Code:

index=$(expr index $pw [[:alpha:]])             # Get index of first alpha char
pwa=$pw                                         # copy password string to intermediate string
while [ $index -ne 1 ]
do
    pwa=${pwa#[^[:alpha:]]}                     # alpha string
    index=$(expr index $pwa [[:alpha:]])
done
echo "After stripping leading non-alpha characters, pwa = $pwa."

# Trim trailing non-alpha characters from pwa
pwa=${pwa%%[^[:alpha:]]*}
echo "After trimming trailing non-alpha characters, pwa = $pwa."

sixerjman · 10-09-2006, 04:14 PM

Thanks soggy. To clarify, the string won't be exactly 8 characters, it must be at least 8 characters (I made a mistake in my earlier post). Also, the alpha characters may appear anywhere in the string, interspersed with digits (0-9), and special characters (@, #, $, %, &, *, +, -, or =).

sixerjman · 10-09-2006, 05:04 PM

The '*' at the end of the construct I posted earlier would delete alpha character(s) at the end of the string. Consider:

Code:

string="abcd32+a";string=${string%%[^[:alpha:]]*};echo $string;
abcd

So I changed the '*' after the RE to '\+' which would specify at least one occurrence of non-alpha characters to be removed:

Code:

 string="abcd32+a";string=${string%%[^[:alpha:]]\+};echo $string;
abcd32+a

OK, that taken care of, let's see what happens if there's no trailing alpha character. I expect(ed) the '32+' to be
the 'longest substring at the end of the string matching the RE':

Code:

string="abcd32+";string=${string%%[^[:alpha:]]\+};echo $string;
abcd3

Why was the 3 not removed?

sixerjman · 10-10-2006, 01:36 PM

Ugh, as soon as I entered a string without the character 'a', the script would loop endlessly because 'expr' evaluated the character class [[:alpha:]] as the string 'a'. I guess I should thoroughly read and understand the doc before starting to code (duh). Good way to really learn is to totally bollux something up lol

Well, back to the drawing board, planning some combination of 'tr' command (i.e. tr -cd [:alpha:] to strip away non-alpha characters, then do some substring matching with expr.

Onward...

sixerjman · 10-11-2006, 05:42 PM

SUCCESS! Here's the script fragment:

Code:

# Check 4 character alphabetic strings against
# dictionary.  First, strip non-alpha characters from
# the string
pwa=$(echo $pw | tr -cd '[:alpha:]')            # Delete the complement of the alpahbetic set
lenpwa=${#pwa}
wpwa=$pwa                                       # Work password
lenwpwa=$lenpwa                                 # Length of work password
if [ $lenpwa -ge 4 ]
then
    lenss=4                                     # Start substring matching
    ssindex=0
    pwaindex=0
    while [ $lenwpwa -ge 4 ]
    do
        while [ $lenss -le $lenwpwa ]
        do
            ss=${wpwa:$ssindex:$lenss}
            #echo $ss
            if [ $(grep -x $ss $DICTFILE) ]
            then
                echo "Invalid password:  $pw contains word '$ss'."
                echo "Passwords must not contain words found in the dictionary."
                exit $E_DATAERR
            fi
            let 'lenss+=1'
        done
        let 'pwaindex+=1'
        wpwa=${pwa:pwaindex}
        lenwpwa=${#wpwa}                        # New length of alpha string
        ssindex=0                               # reset index to substring
        lenss=4                                 # reset minimum length of substring
    done
fi

archtoad6 · 11-20-2006, 02:04 PM

Will this do what you want?

Code:

PWD=`echo $PW |egrep -o '[a-zA-Z]{4,}'`

sixerjman · 11-20-2006, 05:50 PM

Thanks, but the command you showed never advances through the entire string, just displays the longest alphabetic substring in the string.

archtoad6 · 11-20-2006, 09:01 PM

Are you sure?:

Code:

$ echo "abcde11werty" |egrep -o '[a-zA-Z]{4,}'
abcde
werty

sixerjman · 11-20-2006, 11:07 PM

Well, yeah, it extracts 4 character or longer strings but doesn't parse substrings within those strings. The script checks strings 4 characters or longer to see if they contain dictionary words. So while:

echo "abworde11werty" | egrep -o '[a-zA-Z]{4,}'

reports:

abworde
werty

It doesn't move forward through substring 'abworde' to 'bworde' then 'worde' where it would find 'word' within the initial string. The code I have above does that.

archtoad6 · 11-21-2006, 05:30 PM

Correct me if I'm wrong, I have a hard time reading, let alone copy-&-pasting, wide <code> posts in my browser,
but doesn't your code ignore "words" that are split by non-alpha characters?

I mean '12al34pha65' would have the "alpha" parsed out & objected to.

If this so & you are cool w/ this behavior, do please document it.

Would this extension to my ideas work for you:

Code:

fgrep -F $DICT `echo $PWD |egrep -o '[a-zA-Z]{4,}'`

I have used fgrep before (actually the "-f" option) & it is amazingly fast. You might want to include it regardless.