ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Have: a file which contains FirstName(TAB)LastName(TAB)blah(TAB)blahblah...
Quote:
HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
Want: to code grep to ...
a) find names which have a double letter anywhere in the name.
b) find names which have a double letter in the first name.
c) find names which have a double letter in the last name.
d) find names which have a double letter in both names (not necessarily the same letter).
"Double letter" means two adjacent characters are the same.
WILLIAM has one double letter, L.
This does "a" ...
Code:
cut -f1-2 $InFile |egrep '(.)\1'
This gets close to "d" ...
Code:
cut -f1-2 $InFile |egrep '(.)\1+.*(.)\2+' |
How should "b", "c", and "d" coded? Several attempts to introduce a tab character into the Regular Expression were unsuccessful.
I think you are running afoul of the usual scenario where you are trying to use the wrong tool for the job.
You have already used 2 tools by including cut, so it is not just a grep solution, so use awk (perl / ruby) and use the same regexes you currently have on the individual columns
$ cat in
HENRY COLE apple cherry
WOODROW JOHNSON potato tomato
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
$ grep -E '^\w+(.)\1' in
WOODROW JOHNSON potato tomato
WILLIAM KENNEDY onion cabbage
Case (c) -- double name in the last name:
Code:
$ grep -E '^\w+\s+\w+(.)\1' /tmp/in
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
Case (d) -- double letters in both names:
Code:
$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
WILLIAM KENNEDY onion cabbage
Or
Code:
$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in
WILLIAM KENNEDY onion cabbage
So, the trick is to anchor RE to beginning of the string using ^.
Since all is delimited by SPACE and structure is deterministic, it's pretty simple to determine if there are repeated characters; one swipe where I'd populate a list of flags per line indicating conditions a-d, inclusive.
Or not even produce flags I guess, just output the report as I parsed the data. The problem is very open ended with respect to what you do when you find conditions.
If the OP wasn't around the forum for as long as they've been I'd wonder if this was some sort of homework problem. That is interesting too, someone posted a challenge thread about reputation and whether or not some are addicted to it. Having seen DBM's name around, plus the statistics and rep pretty much settled whether or not I'd really make that accusation. Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.
Here I'm assuming it's a parallel to a problem you're trying to solve, or you just playin' around with grep + regex.
Just "playin' around."
All my programming is recreational. I'm retired and dabble in Linux programming in an effort to stave off old-age brain rot. My best "teacher" is this forum, augmented by on-line tutorials and Google. I dream up interesting problems and solve them, as learning exercises. When stumped I post here. I do give Rep points to all who offer constructive responses.
Case (b) -- double letters in the first name:
$ grep -E '^\w+(.)\1' in
Case (c) -- double letter in the last name:
$ grep -E '^\w+\s+\w+(.)\1' /tmp/in
Case (d) -- double letters in both names:
$ grep -E '^\w+(.)\1\w*\s+\w+(.)\2' /tmp/in
$ grep -E '^(\w+(.)\2\w*\s+){2}' /tmp/in
... were plainly tested before you posted them, and they work. However, on my PC they produce empty output files. The code doesn't crash, it doesn't trigger error messages, it runs and outputs nothing.
I suspect this is because I am running Ubuntu 10.04, a back-level version.
Code:
daniel@daniel-desktop:~$ grep --version
GNU grep 2.5.4
I've tried to install later versions of Ubuntu and (so far) my computer pukes so I continue to limp along with 10.04. One of these days I will try again!
I see nothing special about the regexes provided that would not work in an older version??
If you simply change the \w and \s with their character class counter parts do you get output?
I compiled grep v2.5.4 and confirm that it does not work. After some research it turned out that \s and \S were introduced only in this commit half way to version 2.6. You may compile and install more recent version of grep (I have v2.16) or install newer version of OS or use [[:space:]] (or something like [ \t\r] or just space) instead of \s in your regexes.
echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[ ]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
... produced these results ...
Code:
firstfire solution for Case 'c', doubled letter in the last name.
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
End Of File"
We see that my back-level grep understands \w but not \s. I tried three substitutes for \s thinking them equivalent, but they aren't.
Code:
Substituting [\t] didn't work. Why?
Substituting [\x09] didn't work. Why?
Substituting [ ] did work but is not readable.
(That was keyed as left bracket, tab, right bracket.)
The fact that [ ] (single space in square brackets) worked tells that your data is actually SPACE delimited. Therefore TAB (\t) will not be matched as a delimiter. Note that using single character inside square brackets is usually (but not always) useless and equivalent to the character itself. [abc] matches any single character from the list: either a or b or c. Square brackets are useful when you want to match either space or tab: [ \t] (this is SPACE and TAB inside []).
To match character by its code use -P flag (it is supported in v2.5.4).
So, it looks like you lost TABs in your data, probably when copy-pasting it. To check you may use hexdump or, better in this case, use sed:
Code:
$ echo -e 'A \tB'
A B
$ echo -e 'A \tB' | sed -n 'l'
A \tB$
The 'l' command prints line in a visually unambiguous form: TAB becomes \t. $ means end of line.
I looked for corrupted data and the tabs are still there. Here is evidence.
This code ...
Code:
echo; echo "firstfire solution for Case 'c', doubled letter in the last name."
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"
grep -E '^\w+\s+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\t]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[\x09]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
grep -E '^\w+[ ]+\w+(.)\1' $InFile >$Work3
echo "Work3 ..."; cat $Work3; echo "End Of File"
echo; echo "InFile..."; cat -A $InFile; echo "End Of File"
... produced this result ...
Code:
firstfire solution for Case 'c', doubled letter in the last name.
InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
End Of File
Work3 ...
HERMAN HOOPER lemon celery
WILLIAM KENNEDY onion cabbage
PAUL BASSETT banana peach
End Of File
InFile...
HENRY^ICOLE^Iapple^Icherry$
WOODROW^IJOHNSON^Ipotato^Itomato$
HERMAN^IHOOPER^Ilemon^Icelery$
WILLIAM^IKENNEDY^Ionion^Icabbage$
PAUL^IBASSETT^Ibanana^Ipeach$
End Of File
Use -P (--perl-regexp) instead of -E. \t and \x09 work only in Perl mode. \s is not supported in your version of grep. I don't know why your 4-th grep command works, probably there are literal TAB inside [].
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.