LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Character class exclusion (https://www.linuxquestions.org/questions/programming-9/character-class-exclusion-4175587514/)

danielbmartin 08-20-2016 11:32 AM

Character class exclusion
 
I have Work1, a file of English words.

As a learning exercise I coded this ...
Code:

echo "Find 5-character words which have the same letter in positions 1 and 3."
echo "  Examples: fifth, mamma, sassy, total."
egrep '^(.).\1..$' $Work1 >$OutFile

... and it works.

To make the exercise more interesting I coded this ...
Code:

echo "Find 5-character words which have the same letter in positions 1 and 3"
echo "  --and-- the character in positions 1 and 3 is not used elsewhere."
echo "  Examples: fifth, total."
egrep '^(.)[^\1]\1[^\1][^\1]$' $Work1 >$OutFile

... and it produces an OutFile identical to the first exercise. Evidently using [^\1] to exclude a specific character from a character class is not doing the job.

Please advise.

Daniel B. Martin

grail 08-20-2016 02:01 PM

I found a solution but it is a little out there. Also my reference was posted over 5 years ago so there may be an alternative now.
Anyhoo, here is what worked:
Code:

grep -P '^(.)(?:(?!\1).)\1(?:(?!\1).)(?:(?!\1).)$' word_file
It seems you cannot negate a back reference, but you can negate a look-ahead. You will notice I have also switched from -E (what you are using) to -P for perl regular expressions which support look-aheads

I will be interested to see if there is an alternative or even a way to shorten the current solution :)

ntubski 08-20-2016 08:00 PM

Well, this is not shorter, but (IMO) more readable (basically a straightforward translation of each condition into awk):
Code:

awk -F '' 'NF == 5 && $1 == $3 && !index($2 substr($0, 4), $1)' input

keefaz 08-21-2016 05:43 AM

You can make it shorter by removing the (?: ... ) groups
Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words

danielbmartin 08-21-2016 11:13 AM

Quote:

Originally Posted by keefaz (Post 5593826)
Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words

Concise, correct, and over my head! Please elaborate and explain.

Daniel B. Martin

keefaz 08-21-2016 11:58 AM

Quote:

Originally Posted by perlre
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $& . Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
Code:

(?!pattern)
A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".

If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use lookbehind instead (see below).

http://perldoc.perl.org/perlre.html#Extended-Patterns > Lookaround Assertions
Code:

grep -P '^(.)(?!\1).\1(?!\1).(?!\1).$' words
Decomposed:
Code:

grep -P        enable perl regular expression
^(.)          matches any character at start of string and captures it in \1
(?!\1)        matches previous that isn't followed by \1, no capture
.        matches any character (that isn't \1 as previous rule)
\1        matches captured character in \1
(?!\1)        matches previous that isn't followed by \1, no capture
.        matches any character (that isn't \1 as previous rule)
(?!\1)        matches previous that isn't followed by \1, no capture
.        matches any character (that isn't \1 as previous rule)
$        end of string


grail 08-21-2016 12:58 PM

And a little more tidy up :)
Code:

grep -P '^(.)(?!\1).\1((?!\1).){2}$' word_file

danielbmartin 08-21-2016 02:04 PM

Thank you to keefaz and grail for thoughtful contributions. This thread is marked SOLVED!

Daniel B. Martin


All times are GMT -5. The time now is 11:42 PM.