[SOLVED] Finding doubled letters with grep

danielbmartin · 09-22-2014, 07:19 AM

Have: a file which contains FirstName(TAB)LastName(TAB)blah(TAB)blahblah...

Quote:

HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach

Want: to code grep to ...
a) find names which have a double letter anywhere in the name.
b) find names which have a double letter in the first name.
c) find names which have a double letter in the last name.
d) find names which have a double letter in both names (not necessarily the same letter).

"Double letter" means two adjacent characters are the same.
WILLIAM has one double letter, L.

This does "a" ...

Code:

cut -f1-2 $InFile |egrep '(.)\1'

This gets close to "d" ...

Code:

cut -f1-2 $InFile |egrep '(.)\1+.*(.)\2+' |

How should "b", "c", and "d" coded? Several attempts to introduce a tab character into the Regular Expression were unsuccessful.

Daniel B. Martin

grail · 09-22-2014, 09:32 AM

I think you are running afoul of the usual scenario where you are trying to use the wrong tool for the job.
You have already used 2 tools by including cut, so it is not just a grep solution, so use awk (perl / ruby) and use the same regexes you currently have on the individual columns

firstfire · 09-22-2014, 09:37 AM

Hi, Daniel.

Case (b) -- double letters in the first name:

Code:

$ cat in
HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach 
$ grep -E '^\w+(.)\1' in
WOODROW JOHNSON potato tomato
WILLIAM KENNEDY onion cabbage

Case (c) -- double name in the last name:

Code:

$ grep -E '^\w+\s+\w+(.)\1' /tmp/in
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach

Case (d) -- double letters in both names:

Code:

$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
WILLIAM KENNEDY onion cabbage

Or

Code:

$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in
WILLIAM KENNEDY onion cabbage

So, the trick is to anchor RE to beginning of the string using ^.

ntubski · 09-22-2014, 09:44 AM

Quote:

Originally Posted by danielbmartin

Several attempts to introduce a tab character into the Regular Expression were unsuccessful.

Bash has ANSI-C quoting which you could use for this, eg to solve b):

Code:

cut -f1-2 $InFile | grep -E $'^[^\t]*(.)\\1'

Or you could insert a literal tab, either using your text editor if it's a script, or Ctrl+V <tab> should work from the command line.

rtmistler · 09-22-2014, 10:22 AM

If it was me I'd write a program here.

Since all is delimited by SPACE and structure is deterministic, it's pretty simple to determine if there are repeated characters; one swipe where I'd populate a list of flags per line indicating conditions a-d, inclusive.

Or not even produce flags I guess, just output the report as I parsed the data. The problem is very open ended with respect to what you do when you find conditions.

If the OP wasn't around the forum for as long as they've been I'd wonder if this was some sort of homework problem. That is interesting too, someone posted a challenge thread about reputation and whether or not some are addicted to it. Having seen DBM's name around, plus the statistics and rep pretty much settled whether or not I'd really make that accusation. Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.

danielbmartin · 09-22-2014, 11:20 AM

Quote:

Originally Posted by rtmistler

Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.

Just "playin' around."

All my programming is recreational. I'm retired and dabble in Linux programming in an effort to stave off old-age brain rot. My best "teacher" is this forum, augmented by on-line tutorials and Google. I dream up interesting problems and solve them, as learning exercises. When stumped I post here. I do give Rep points to all who offer constructive responses.

Daniel B. Martin

danielbmartin · 09-23-2014, 10:38 AM

[QUOTE=firstfire;5242408]Hello firstfire.

Your solutions ...

Code:

Case (b) -- double letters in the first name:
$ grep -E '^\w+(.)\1' in

Case (c) -- double letter in the last name:
$ grep -E '^\w+\s+\w+(.)\1' /tmp/in

Case (d) -- double letters in both names:
$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in

... were plainly tested before you posted them, and they work. However, on my PC they produce empty output files. The code doesn't crash, it doesn't trigger error messages, it runs and outputs nothing.

I suspect this is because I am running Ubuntu 10.04, a back-level version.

Code:

daniel@daniel-desktop:~$ grep --version
GNU grep 2.5.4

I've tried to install later versions of Ubuntu and (so far) my computer pukes so I continue to limp along with 10.04. One of these days I will try again!

Daniel B. Martin

grail · 09-23-2014, 11:26 AM

I see nothing special about the regexes provided that would not work in an older version??
If you simply change the \w and \s with their character class counter parts do you get output?

firstfire · 09-23-2014, 01:39 PM

Hi, Daniel.

I compiled grep v2.5.4 and confirm that it does not work. After some research it turned out that \s and \S were introduced only in this commit half way to version 2.6. You may compile and install more recent version of grep (I have v2.16) or install newer version of OS or use [[:space:]] (or something like [ \t\r] or just space) instead of \s in your regexes.

danielbmartin · 09-23-2014, 06:25 PM

More has been learned but new questions arise.

This code segment ...

Code:

echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[	]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"

... produced these results ...

Code:

firstfire solution for Case 'c', doubled letter in the last name.
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN	HOOPER	lemon	celery
WILLIAM	KENNEDY	onion	cabbage
PAUL	BASSETT	banana	peach
End Of File"

We see that my back-level grep understands \w but not \s. I tried three substitutes for \s thinking them equivalent, but they aren't.

Code:

Substituting [\t] didn't work.  Why?
Substituting [\x09] didn't work.  Why?
Substituting [  ] did work but is not readable.
(That was keyed as left bracket, tab, right bracket.)

Please show how [\t] can be made to work.

Daniel B. Martin

firstfire · 09-24-2014, 12:11 AM

Hi.

The fact that [ ] (single space in square brackets) worked tells that your data is actually SPACE delimited. Therefore TAB (\t) will not be matched as a delimiter. Note that using single character inside square brackets is usually (but not always) useless and equivalent to the character itself. [abc] matches any single character from the list: either a or b or c. Square brackets are useful when you want to match either space or tab: [ \t] (this is SPACE and TAB inside []).

To match character by its code use -P flag (it is supported in v2.5.4).

So, it looks like you lost TABs in your data, probably when copy-pasting it. To check you may use hexdump or, better in this case, use sed:

Code:

$ echo -e 'A \tB'
A 	B
$ echo -e 'A \tB' | sed -n 'l'
A \tB$

The 'l' command prints line in a visually unambiguous form: TAB becomes \t. $ means end of line.

danielbmartin · 09-24-2014, 03:07 AM

Quote:

Originally Posted by firstfire

... it looks like you lost TABs in your data ...

I looked for corrupted data and the tabs are still there. Here is evidence.

This code ...

Code:

echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[	]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"

... produced this result ...

Code:

firstfire solution for Case 'c', doubled letter in the last name.

InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN	HOOPER	lemon	celery
WILLIAM	KENNEDY	onion	cabbage
PAUL	BASSETT	banana	peach
End Of File

InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File

Daniel B. Martin

firstfire · 09-24-2014, 04:39 AM

Use -P (--perl-regexp) instead of -E. \t and \x09 work only in Perl mode. \s is not supported in your version of grep. I don't know why your 4-th grep command works, probably there are literal TAB inside [].

danielbmartin · 09-24-2014, 06:08 AM

Quote:

Originally Posted by firstfire

Use -P (--perl-regexp) instead of -E.

YESSS! All of the previously tried code variations work with -P. Thank you!

Daniel B. Martin