conditional greping

ut0ugh1 · 10-26-2011, 03:41 PM

i would obtain a filter with multiple conditional greping some like:
echo -n "whatever24characterlong" | grep '[bcdfghjklmnpqrstvzxyw]\{16,24\}'| grep '[aeiou]\{0,6\}'| grep '[0123456789]\{0,6\}'|
but i would like to exclude more than 4 subsequent vowels, 4 subsequent consonants, 3 subsequent numbers and exclude words with more than 15 different consonants. can u help me, plz.thx.

jhwilliams · 10-26-2011, 03:49 PM

You do not at any point ask a question, and your writing is poorly legible, make it difficult to respond adequately.

Look at lex if you're doing regexes more seriously than a string of greps can provide.

http://en.wikipedia.org/wiki/Lex_(so..._of_a_lex_file

ut0ugh1 · 10-26-2011, 04:36 PM

i would drop out all 24 long words from a file with multiple grep as above:
grep '[bcdfghjklmnpqrstvzxyw]\{16,24\}'| grep '[aeiou]\{0,6\}'| grep '[0123456789]\{0,6\}'|
but i would like to exclude more than 4 subsequent vowels, 4 subsequent consonants, 3 subsequent numbers and exclude words with more than 15 different consonants.
so acceptable words would have from 16 to 24 consonants, 0 to 6 vowels, 0 to 6 numbers, not more than 4 subsequent vowels, 4 subsequent consonants, 3 subsequent numbers and exclude words with more than 15 different consonants.
accepted ex.:
bcddddeffghklmnn0aezxwsw
i am an almost completly newbie. thx 1more time.

David the H. · 10-26-2011, 05:46 PM

First of all, please use [code][/code] tags around your code, to preserve formatting and to improve readability.

Second, please give us a real-life example of the input text, and what kind of output you want from it. Also, could you explain your purpose for wanting to do this, so we can understand the context better?

And what exactly is the problem you're having with the code you have already?

Your first grep pattern in particular seems off to me. Does your input really have lines with strings of 16-24 consecutive consonants in them?

ut0ugh1 · 10-26-2011, 06:34 PM

Code:

| grep '[bcdfghjklmnpqrstvzxyw]\{16,24\}'| grep '[aeiou]\{0,6\}'| grep '[0123456789]\{0,6\}'|

sorry by "not more than 4 subsequent vowels, 4 subsequent consonants, 3 subsequent numbers and exclude words with more than 15 different consonants." i mean "not more than 4 same subsequent vowels (eg.: not xxxxaaaaaxxxxxxxxxxxxxxx, xxxxxxxxxxeeeeexxxxxxxxx), 4 same subsequent consonants (eg: not xxbbbbbxxxxxxxxxxxxxxxxx, xxxxxxxxxxxxxxxzzzzzxxxx), 3 same subsequent numbers (eg: xxxxxxxxxx1111xxxxxxxxxx, xxxxxxxxxxxxxxxxxxxx0000) and exclude words with more than 15 different consonants (eg: not 1bcdfufghlmnkpkqarrstv).
accepted eg.:
bcddddeffghklmnn0aezxwsw
dvq7umsylnrfzdd2qgmgofmt
wgammjawtnedivjxpgzcynx9
qicqulsmcrbmuampwatk7hih

David the H. · 10-27-2011, 12:05 PM

Sorry, it's still not clear to me. You'll have to break it down in more detail. First, as I asked, please give us a real-life example of the text, including both lines that you want, and lines that you don't want. Put it in code tags, to keep the formatting (this would also allow us to test possible solutions).

Then, please detail exactly what criteria constitute a desired line, and what constitutes excluded lines. Break it down into simple steps or sections, if possible (e.g. each line must first have ..., then ...., but not ....), with examples. And please separate your points with more whitespace. The solid blocks of text you're using are hard to read.

Also, lets make sure your terms are correct. Subsequent means "following", or "coming after" If you have "AB CD", then "CD" is subsequent to "AB". Consecutive means "in a continuous, unbroken string". "AAACCC" is three consecutive "A"s followed by three consecutive "C"s.

I do hope you realize that a chain of greps like you posted causes each one to filter the output of the previous command. It doesn't directly analyze the sequence inside each line.

For example, this appears to be what your grep commands do now (and actually, you should be using egrep/grep -E):

Input file (file.txt):

Code:

bcddddeffghklmnn0aezxwsw
dvq7umsylnrfzdd2qgmgofmt
wgammjawtnedivjxpgzcynx9
qicqulsmcrbmuampwatk7hih
bcdfghjklmnpqrstvwxaeiou
bcdfghjklmnpqrst234aeiou
aei12bcdfghjklmnpqrst01a

1) Your first grep matches and prints out strings of 16-24 consecutive lowercase consonants:

Code:

$ egrep '[bcdfghjklmnpqrstvzxyw]{16,24}' file.txt
bcdfghjklmnpqrstvwxaeiou
bcdfghjklmnpqrst234aeiou
aei12bcdfghjklmnpqrst01a

Notice that only the last three lines that I added match, because only they have 16+ consecutive consonants. None of the strings you gave above match this rule.

2) From the output of the last grep, match 0-6 consecutive vowels:

Code:

$ ...| egrep '[aeiou]{0,6}'
bcdfghjklmnpqrstvwxaeiou
bcdfghjklmnpqrst234aeiou
aei12bcdfghjklmnpqrst01a

All three previous lines match, but probably not in the way you want them to. The last one actually matches twice.

3) Finally, find strings of 0-6 digits from the previous output:

Code:

$ ...| egrep '[0123456789]{0,6}'
bcdfghjklmnpqrstvwxaeiou
bcdfghjklmnpqrst234aeiou
aei12bcdfghjklmnpqrst01a

Again, all three match, but the first one matches because it has zero digits in it, and the second one again has multiple matches.

It seems to me that what you really want is a context-sensitive match, with each section depending on what comes before it in the string. Now if you could explain exactly what a single line pattern should be then perhaps you can build it into a single regex. That is, if you wanted something like the following:

[16-24 consonants] followed by [0-6 vowels] followed by [0-6 digits]

Then a single grep like this would match the above text like so:

Code:

$ egrep '[bcdfghjklmnpqrstvzxyw]{16,24}[aeiou]{0,6}[0123456789]{0,6}' g_file.txt
bcdfghjklmnpqrstvwxaeiou
bcdfghjklmnpqrst234aeiou
aei12bcdfghjklmnpqrst01a

But this still wouldn't fulfill your final requirement of no more than 15 different consonants. That's not something grep/regex can do on its own. You'd need some kind of function to go through the string and count the number of different characters in it, then test that number for compliance.

So I think that you really need to do as jhwilliams suggested and use a real lexical parser, something that can analyze the whole string in context according to your desired rules. Or at least to use a full-featured text-processing language like perl. What you want is probably too complex for a few simple grep commands.

ut0ugh1 · 10-27-2011, 01:31 PM

i am a newbie so tell me you how to obtain from alphanumeric 24 character long words words such as
dvq7umsylnrfzdd2qgmgofmt
wgammjawtnedivjxpgzcynx9
qicqulsmcrbmuampwatk7hih
with grep or sed. thx

crabboy · 10-27-2011, 02:12 PM

Seems to me like you are still up to no good, but this time omitting your intent.

http://www.linuxquestions.org/questi...number-910006/

Closing thread.