LinuxQuestions.org - [SOLVED] Character class exclusion

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Character class exclusion (https://www.linuxquestions.org/questions/programming-9/character-class-exclusion-4175587514/)

Character class exclusion

I have Work1, a file of English words.

As a learning exercise I coded this ...

Code:

echo "Find 5-character words which have the same letter in positions 1 and 3."

echo "  Examples: fifth, mamma, sassy, total."

egrep '^(.).\1..$' $Work1 >$OutFile

... and it works.

To make the exercise more interesting I coded this ...

Code:

echo "Find 5-character words which have the same letter in positions 1 and 3"

echo "  --and-- the character in positions 1 and 3 is not used elsewhere."

echo "  Examples: fifth, total."

egrep '^(.)[^\1]\1[^\1][^\1]$' $Work1 >$OutFile

... and it produces an OutFile identical to the first exercise. Evidently using [^\1] to exclude a specific character from a character class is not doing the job.

Please advise.

Daniel B. Martin

I found a solution but it is a little out there. Also my reference was posted over 5 years ago so there may be an alternative now.
Anyhoo, here is what worked:

Code:

grep -P '^(.)(?:(?!\1).)\1(?:(?!\1).)(?:(?!\1).)$' word_file

It seems you cannot negate a back reference, but you can negate a look-ahead. You will notice I have also switched from -E (what you are using) to -P for perl regular expressions which support look-aheads

I will be interested to see if there is an alternative or even a way to shorten the current solution :)

Well, this is not shorter, but (IMO) more readable (basically a straightforward translation of each condition into awk):

Code:

awk -F '' 'NF == 5 && $1 == $3 && !index($2 substr($0, 4), $1)' input

You can make it shorter by removing the (?: ... ) groups

Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words

Quote:

Originally Posted by keefaz (Post 5593826)

Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words

Concise, correct, and over my head! Please elaborate and explain.

Daniel B. Martin

Quote:

Originally Posted by perlre

Lookaround assertions are zero-width patterns which match a specific pattern without including it in $& . Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.

Code:

(?!pattern)

A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".

If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use lookbehind instead (see below).

http://perldoc.perl.org/perlre.html#Extended-Patterns > Lookaround Assertions

Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words

Decomposed:

Code:

grep -P        enable perl regular expression

^(.)          matches any character at start of string and captures it in \1

(?!\1)        matches previous that isn't followed by \1, no capture

.        matches any character (that isn't \1 as previous rule)

\1        matches captured character in \1

(?!\1)        matches previous that isn't followed by \1, no capture

.        matches any character (that isn't \1 as previous rule)

(?!\1)        matches previous that isn't followed by \1, no capture

.        matches any character (that isn't \1 as previous rule)

$        end of string

And a little more tidy up :)

Code:

grep -P '^(.)(?!\1).\1((?!\1).){2}$' word_file

Thank you to keefaz and grail for thoughtful contributions. This thread is marked SOLVED!

Daniel B. Martin