LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-08-2006, 11:54 PM   #1
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Rep: Reputation: 30
Regular expression to match 4 or more alpha characters


I'm trying to test for 4 or more alphabetic characters. Using regex with the 'expr' command, I can get 4 or more as long as they are at the beginning of the string as follows:

Code:
expr match "abcdef123" '\([a-z|A-Z]*\)\{4,\}'
But if I add any non-alpha character at the beginning of the string the match fails. From what I read the default for 'expr match' is to start matching from the beginning of the string.

Can I change this default behavior to match anywhere in the string? If not, is there another way to extract 4 or more alpha characters using regular expressions? Thanks in advance!
 
Old 10-09-2006, 12:56 AM   #2
zhangmaike
Member
 
Registered: Oct 2004
Distribution: Slackware
Posts: 376

Rep: Reputation: 31
First of all, your original regular expression '\([a-z|A-Z]*\)\{4,\}' will match alphabetic character strings of ANY length (not just greater than 4) because of the * character.

After fixing that, all you need to do is add something to the regular expression that will match the beginning part of the string: .*

The following:

Code:
expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'
Finds abcd in the string "1a2abcd4ef123".

Although a match guarantees the presence of a string of 4 or more alphabetic characters, the above expression won't return that string - it will just return the last string of 4 letters (since any letters preceeding those final 4 will be matched by .*).

So the above expression only finds defg in the string "abcdefg" ... which is probably still not what you're looking for.

Last edited by zhangmaike; 10-09-2006 at 01:23 AM.
 
Old 10-09-2006, 02:29 AM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Quote:
Originally Posted by sixerjman
I'm trying to test for 4 or more alphabetic characters.
Sometimes, there's no need to use regexp.
If you have Python:
Code:
#!/usr/bin/python
total = 0
s = "abcdef123"
for ch in s:
 	if ch.isalpha(): #test if alphabet
 		total = total + 1 
if total > 4:
   print "There are more than 4 alphabetic chars"
 
Old 10-09-2006, 11:55 AM   #4
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Thanks for the replies!

I am trying to work strictly within bash here so thanks for the python suggestion but it's N/A in my case (C would also work but...).

I think the expr match does return the string:

Code:
dbrazziel@emach433:~/scripts$ expr match "1a2abcd4ef123" '.*\([a-z|A-Z]\{4,\}\)'
abcd
The string I'll be working with is only 8 characters long,
so I need a way to get strings of length 4 through 8, and I think I can do that with a combination of 'expr match' and 'expr index'.

I'll let you know how I come out. Thanks again!
 
Old 10-09-2006, 12:59 PM   #5
soggycornflake
Member
 
Registered: May 2006
Location: England
Distribution: Slackware 10.2, Slamd64
Posts: 249

Rep: Reputation: 31
Quote:
The string I'll be working with is only 8 characters long,
so I need a way to get strings of length 4 through 8,
There's no need to use regex just to match strings of a particular length, the ${#...} operator returns the length of the parameter.

Code:
str="abcdef123"

if [[ ${#str} -gt 4 ]]; then
        ....
fi
 
Old 10-09-2006, 04:08 PM   #6
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Will start by removing non-alpha chars at beginning and end

Ghostdog, I loved that movie! :-)

OK, I think I will trim the longest non-alpha strings from the front and back of the string to begin with, that should simplify things a little. Here's what I have come up with so far:

Code:
index=$(expr index $pw [[:alpha:]])             # Get index of first alpha char
pwa=$pw                                         # copy password string to intermediate string
while [ $index -ne 1 ]
do
    pwa=${pwa#[^[:alpha:]]}                     # alpha string
    index=$(expr index $pwa [[:alpha:]])
done
echo "After stripping leading non-alpha characters, pwa = $pwa."

# Trim trailing non-alpha characters from pwa
pwa=${pwa%%[^[:alpha:]]*}
echo "After trimming trailing non-alpha characters, pwa = $pwa."
 
Old 10-09-2006, 04:14 PM   #7
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Thanks soggy. To clarify, the string won't be exactly 8 characters, it must be at least 8 characters (I made a mistake in my earlier post). Also, the alpha characters may appear anywhere in the string, interspersed with digits (0-9), and special characters (@, #, $, %, &, *, +, -, or =).
 
Old 10-09-2006, 05:04 PM   #8
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Having problems with the '%%" form of substring removal

The '*' at the end of the construct I posted earlier would delete alpha character(s) at the end of the string. Consider:

Code:
string="abcd32+a";string=${string%%[^[:alpha:]]*};echo $string;
abcd
So I changed the '*' after the RE to '\+' which would specify at least one occurrence of non-alpha characters to be removed:

Code:
 string="abcd32+a";string=${string%%[^[:alpha:]]\+};echo $string;
abcd32+a
OK, that taken care of, let's see what happens if there's no trailing alpha character. I expect(ed) the '32+' to be
the 'longest substring at the end of the string matching the RE':

Code:
string="abcd32+";string=${string%%[^[:alpha:]]\+};echo $string;
abcd3
Why was the 3 not removed?
 
Old 10-10-2006, 01:36 PM   #9
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Agonizing reappraisal: expr index operates strictly on (sub)strings, not regex's

Ugh, as soon as I entered a string without the character 'a', the script would loop endlessly because 'expr' evaluated the character class [[:alpha:]] as the string 'a'. I guess I should thoroughly read and understand the doc before starting to code (duh). Good way to really learn is to totally bollux something up lol

Well, back to the drawing board, planning some combination of 'tr' command (i.e. tr -cd [:alpha:] to strip away non-alpha characters, then do some substring matching with expr.

Onward...
 
Old 10-11-2006, 05:42 PM   #10
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Regular expressions and substring extraction of 4 character alpha strings

SUCCESS! Here's the script fragment:

Code:
# Check 4 character alphabetic strings against
# dictionary.  First, strip non-alpha characters from
# the string
pwa=$(echo $pw | tr -cd '[:alpha:]')            # Delete the complement of the alpahbetic set
lenpwa=${#pwa}
wpwa=$pwa                                       # Work password
lenwpwa=$lenpwa                                 # Length of work password
if [ $lenpwa -ge 4 ]
then
    lenss=4                                     # Start substring matching
    ssindex=0
    pwaindex=0
    while [ $lenwpwa -ge 4 ]
    do
        while [ $lenss -le $lenwpwa ]
        do
            ss=${wpwa:$ssindex:$lenss}
            #echo $ss
            if [ $(grep -x $ss $DICTFILE) ]
            then
                echo "Invalid password:  $pw contains word '$ss'."
                echo "Passwords must not contain words found in the dictionary."
                exit $E_DATAERR
            fi
            let 'lenss+=1'
        done
        let 'pwaindex+=1'
        wpwa=${pwa:pwaindex}
        lenwpwa=${#wpwa}                        # New length of alpha string
        ssindex=0                               # reset index to substring
        lenss=4                                 # reset minimum length of substring
    done
fi
 
Old 11-20-2006, 02:04 PM   #11
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 230Reputation: 230Reputation: 230
Will this do what you want?
Code:
PWD=`echo $PW |egrep -o '[a-zA-Z]{4,}'`
 
Old 11-20-2006, 05:50 PM   #12
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Thanks, but the command you showed never advances through the entire string, just displays the longest alphabetic substring in the string.
 
Old 11-20-2006, 09:01 PM   #13
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 230Reputation: 230Reputation: 230
Are you sure?:
Code:
$ echo "abcde11werty" |egrep -o '[a-zA-Z]{4,}'
abcde
werty
 
Old 11-20-2006, 11:07 PM   #14
sixerjman
Member
 
Registered: Sep 2004
Distribution: Debian Tesing / Unstable
Posts: 166
Blog Entries: 1

Original Poster
Rep: Reputation: 30
Well, yeah, it extracts 4 character or longer strings but doesn't parse substrings within those strings. The script checks strings 4 characters or longer to see if they contain dictionary words. So while:

echo "abworde11werty" | egrep -o '[a-zA-Z]{4,}'

reports:

abworde
werty

It doesn't move forward through substring 'abworde' to 'bworde' then 'worde' where it would find 'word' within the initial string. The code I have above does that.
 
Old 11-21-2006, 05:30 PM   #15
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 230Reputation: 230Reputation: 230
Correct me if I'm wrong, I have a hard time reading, let alone copy-&-pasting, wide <code> posts in my browser,
but doesn't your code ignore "words" that are split by non-alpha characters?

I mean '12al34pha65' would have the "alpha" parsed out & objected to.

If this so & you are cool w/ this behavior, do please document it.


Would this extension to my ideas work for you:
Code:
fgrep -F $DICT `echo $PWD |egrep -o '[a-zA-Z]{4,}'`
I have used fgrep before (actually the "-f" option) & it is amazingly fast. You might want to include it regardless.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't match a regular expression dakensta Programming 7 09-21-2006 03:48 AM
perl regular expression a char match richikiki Programming 8 07-19-2006 03:37 AM
Regular expression to match a valid URL string vharishankar Programming 13 07-21-2005 09:17 PM
Need help with Regular Expression subaruwrx Linux - Newbie 6 09-04-2004 07:48 PM
Regular Expression Help WeNdeL Linux - General 1 08-14-2003 10:08 AM


All times are GMT -5. The time now is 10:32 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration