LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-22-2014, 07:19 AM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Finding doubled letters with grep


Have: a file which contains FirstName(TAB)LastName(TAB)blah(TAB)blahblah...
Quote:
HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
Want: to code grep to ...
a) find names which have a double letter anywhere in the name.
b) find names which have a double letter in the first name.
c) find names which have a double letter in the last name.
d) find names which have a double letter in both names (not necessarily the same letter).

"Double letter" means two adjacent characters are the same.
WILLIAM has one double letter, L.

This does "a" ...
Code:
cut -f1-2 $InFile |egrep '(.)\1'
This gets close to "d" ...
Code:
cut -f1-2 $InFile |egrep '(.)\1+.*(.)\2+' |
How should "b", "c", and "d" coded? Several attempts to introduce a tab character into the Regular Expression were unsuccessful.

Daniel B. Martin
 
Old 09-22-2014, 09:32 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I think you are running afoul of the usual scenario where you are trying to use the wrong tool for the job.
You have already used 2 tools by including cut, so it is not just a grep solution, so use awk (perl / ruby) and use the same regexes you currently have on the individual columns
 
Old 09-22-2014, 09:37 AM   #3
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi, Daniel.

Case (b) -- double letters in the first name:
Code:
$ cat in
HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach 
$ grep -E '^\w+(.)\1' in
WOODROW JOHNSON potato tomato
WILLIAM KENNEDY onion cabbage
Case (c) -- double name in the last name:
Code:
$ grep -E '^\w+\s+\w+(.)\1' /tmp/in
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
Case (d) -- double letters in both names:
Code:
$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
WILLIAM KENNEDY onion cabbage
Or
Code:
$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in
WILLIAM KENNEDY onion cabbage
So, the trick is to anchor RE to beginning of the string using ^.

Last edited by firstfire; 09-22-2014 at 09:41 AM.
 
1 members found this post helpful.
Old 09-22-2014, 09:44 AM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by danielbmartin View Post
Several attempts to introduce a tab character into the Regular Expression were unsuccessful.
Bash has ANSI-C quoting which you could use for this, eg to solve b):
Code:
cut -f1-2 $InFile | grep -E $'^[^\t]*(.)\\1'
Or you could insert a literal tab, either using your text editor if it's a script, or Ctrl+V <tab> should work from the command line.
 
1 members found this post helpful.
Old 09-22-2014, 10:22 AM   #5
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
If it was me I'd write a program here.

Since all is delimited by SPACE and structure is deterministic, it's pretty simple to determine if there are repeated characters; one swipe where I'd populate a list of flags per line indicating conditions a-d, inclusive.

Or not even produce flags I guess, just output the report as I parsed the data. The problem is very open ended with respect to what you do when you find conditions.

If the OP wasn't around the forum for as long as they've been I'd wonder if this was some sort of homework problem. That is interesting too, someone posted a challenge thread about reputation and whether or not some are addicted to it. Having seen DBM's name around, plus the statistics and rep pretty much settled whether or not I'd really make that accusation. Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.
 
Old 09-22-2014, 11:20 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by rtmistler View Post
Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.
Just "playin' around."

All my programming is recreational. I'm retired and dabble in Linux programming in an effort to stave off old-age brain rot. My best "teacher" is this forum, augmented by on-line tutorials and Google. I dream up interesting problems and solve them, as learning exercises. When stumped I post here. I do give Rep points to all who offer constructive responses.

Daniel B. Martin
 
Old 09-23-2014, 10:38 AM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
[QUOTE=firstfire;5242408]Hello firstfire.

Your solutions ...
Code:
Case (b) -- double letters in the first name:
$ grep -E '^\w+(.)\1' in

Case (c) -- double letter in the last name:
$ grep -E '^\w+\s+\w+(.)\1' /tmp/in

Case (d) -- double letters in both names:
$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in
... were plainly tested before you posted them, and they work. However, on my PC they produce empty output files. The code doesn't crash, it doesn't trigger error messages, it runs and outputs nothing.

I suspect this is because I am running Ubuntu 10.04, a back-level version.
Code:
daniel@daniel-desktop:~$ grep --version
GNU grep 2.5.4
I've tried to install later versions of Ubuntu and (so far) my computer pukes so I continue to limp along with 10.04. One of these days I will try again!

Daniel B. Martin
 
Old 09-23-2014, 11:26 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I see nothing special about the regexes provided that would not work in an older version??
If you simply change the \w and \s with their character class counter parts do you get output?
 
Old 09-23-2014, 01:39 PM   #9
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi, Daniel.

I compiled grep v2.5.4 and confirm that it does not work. After some research it turned out that \s and \S were introduced only in this commit half way to version 2.6. You may compile and install more recent version of grep (I have v2.16) or install newer version of OS or use [[:space:]] (or something like [ \t\r] or just space) instead of \s in your regexes.
 
1 members found this post helpful.
Old 09-23-2014, 06:25 PM   #10
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
More has been learned but new questions arise.

This code segment ...
Code:
echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[	]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
... produced these results ...
Code:
firstfire solution for Case 'c', doubled letter in the last name.
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN	HOOPER	lemon	celery
WILLIAM	KENNEDY	onion	cabbage
PAUL	BASSETT	banana	peach
End Of File"
We see that my back-level grep understands \w but not \s. I tried three substitutes for \s thinking them equivalent, but they aren't.
Code:
Substituting [\t] didn't work.  Why?
Substituting [\x09] didn't work.  Why?
Substituting [  ] did work but is not readable.
(That was keyed as left bracket, tab, right bracket.)
Please show how [\t] can be made to work.

Daniel B. Martin
 
Old 09-24-2014, 12:11 AM   #11
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

The fact that [ ] (single space in square brackets) worked tells that your data is actually SPACE delimited. Therefore TAB (\t) will not be matched as a delimiter. Note that using single character inside square brackets is usually (but not always) useless and equivalent to the character itself. [abc] matches any single character from the list: either a or b or c. Square brackets are useful when you want to match either space or tab: [ \t] (this is SPACE and TAB inside []).

To match character by its code use -P flag (it is supported in v2.5.4).

So, it looks like you lost TABs in your data, probably when copy-pasting it. To check you may use hexdump or, better in this case, use sed:
Code:
$ echo -e 'A \tB'
A 	B
$ echo -e 'A \tB' | sed -n 'l'
A \tB$
The 'l' command prints line in a visually unambiguous form: TAB becomes \t. $ means end of line.
 
Old 09-24-2014, 03:07 AM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by firstfire View Post
... it looks like you lost TABs in your data ...
I looked for corrupted data and the tabs are still there. Here is evidence.

This code ...
Code:
echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[	]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"
... produced this result ...
Code:
firstfire solution for Case 'c', doubled letter in the last name.

InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN	HOOPER	lemon	celery
WILLIAM	KENNEDY	onion	cabbage
PAUL	BASSETT	banana	peach
End Of File

InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File
Daniel B. Martin
 
Old 09-24-2014, 04:39 AM   #13
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Use -P (--perl-regexp) instead of -E. \t and \x09 work only in Perl mode. \s is not supported in your version of grep. I don't know why your 4-th grep command works, probably there are literal TAB inside [].
 
1 members found this post helpful.
Old 09-24-2014, 06:08 AM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by firstfire View Post
Use -P (--perl-regexp) instead of -E.
YESSS! All of the previously tried code variations work with -P. Thank you!

Daniel B. Martin
 
  


Reply

Tags
grep



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
to grep 100 letters out of 500 letters on certain criteria l_ravi69 Programming 3 10-13-2013 11:37 AM
[SOLVED] Finding and deleting a sequence of letters geodave0110 Linux - Newbie 3 12-08-2010 01:14 PM
[SOLVED] Awk - finding and counting words specific letters within mora978 Programming 9 10-13-2010 10:45 AM
perl grep not finding regex WindozBytes Programming 6 05-29-2008 01:21 PM
grep exact letters packets Linux - Newbie 1 11-30-2007 08:39 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:04 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration