Yet another regex problem

Linux_Kidd · 11-21-2011, 11:37 AM

without lookarounds how would you make a regex that eliminates as much noise as possible. i have to use grep -E.

i am looking for:
(X)anything
or
(X) anything

but want to exclude noise that matches (or exclude as much as possible)
(X)word
(X) word

grail · 11-21-2011, 11:44 AM

You might need to explain more ... what is the difference between 'word' and 'anything'?

Linux_Kidd · 11-21-2011, 12:08 PM

"anything" is random string
"word" is specific string

grep'ing for pattern

Code:

"(X)anything" 
or
"(X) anything"

but wish to exclude from that set

Code:

"(X)word" 
or
"(X) word"

as example:

Code:

\(X\)[^w]

will exclude "(X)word" (along with anything that starts with "w"), but will still match "(X)█word" (the block means space for visual, etc)

a regex that eliminates more than desired is ok, just wish to minimize the exclusion set, etc. its a pita w/o lookarounds, just seeing what you guys might suggest.

sycamorex · 11-21-2011, 02:22 PM

This is what I came at:

Code:

grep -E  "\(X\)[[:blank:]]?[^ w]" file

I don't think it's what you're after as it'll exclude other words starting with "w"

To eliminate the exact word "word":

Code:

grep -E  "\(X\)[[:blank:]]?[^ ]" file | grep -v "word"

Linux_Kidd · 11-21-2011, 03:16 PM

sycamorex,
excluding all words tha begin with "w" produces an exclusion set so big. i was trying to make that exclusion set as small as possible. i can only also do a single regex using grep -E (no piping or posix char sets available, etc)

i came up with this:

Code:

\(X\)([ ][a-z]{2}[^r]|[a-z]{2}[^r])

i am not 100% if the set of words that have "r" as 3rd char is > or < the set of words that have "w" as 1st char. i guess i would need to find some word analytics and choose the not-char location that would produce the smallest set. this method is a pita especially when the "word" changes, etc.

sycamorex · 11-21-2011, 03:25 PM

Could you send a sample data? It migh help

Linux_Kidd · 11-21-2011, 03:53 PM

its hard to give the negative, but lets try.

"(X) Happy" or "(X)Happy" is noise in my files, but i want to find pattern that is equiv boolean to this:
"(X)" followed by NOT "Happy", OR, "(X)" followed by "single space" followed by NOT "Happy"

its a pita w/o lookarounds, so the only way i see is to build regex that gives smallest exclusion set.

sample file

Code:

(X)Happy (X) Happy
(X)Trumpet
(X)Hotcakes
(X) Hamper (X)Happy
(X)Rockets
(X)Apple

Cedrik · 11-21-2011, 04:08 PM

Why not just do:

Code:

grep -vE '(X).?Happy' <file>

sycamorex · 11-21-2011, 05:43 PM

Quote:

Originally Posted by Cedrik

Why not just do:

Code:

grep -vE '(X).?Happy' <file>

Well, it depends on the whole sample data.

It also matches:
- (X) word (2 or more spaces after (X))
- and lines NOT starting with (X)

...if such lines exist.

Is the actual data the OP provided accurate and representative of the whole file?

grail · 11-21-2011, 07:04 PM

I am with Cedric except that the dot should simply be a space. Until provided reasons why it is not acceptable it does answer the present question:

Code:

grep -vE '(X) ?Happy' file

Linux_Kidd · 11-21-2011, 07:26 PM

i dont want lines that dont have the noise and not my wants.

let me try and clarify.

the search tool is a "grep -E" equivalent, so i do not have -v option, or ability to pipe, etc.

Code:

grep searches (X)Happy the named input FILEs 
(or (X) Frown standard input if no files are 
named, or if a (X)puppy single hyphen-minus 
(-) is (X) Happy given as file name) for lines 
containing a match to the (X)Happy given PATTERN. 
By default, grep prints the matching lines.
(X)Happy
In addition, two variant programs egrep 
and fgrep are available. (X) Happy egrep is the 
same as (X)Chuck grep -E. fgrep is the same as 
grep -F. Direct (X) Pencil invocation as either 
egrep or (X)Happy fgrep is deprecated, but is 
provided to allow (X) Happy historical applications 
that rely on them (X) Denny to run unmodified.

noise = "(X)Happy" or "(X) Happy"
hit is "(X)[space][word]" or "(X)[word]"

the sample file above has 14 lines.
if i had lookarounds i would get:
1 no match
2 match for "(X) Frown"
3 match for "(X)puppy"
4 no match
5 no match
6 no match
7 no match
8 no match
9 no match
10 match for "(X)Chuck"
11 match for "(X) Pencil"
12 no match
13 no match
14 match for "(X) Denny"

so w/o lookarounds using "grep -E '/regex/' file" i only see a way to build an exclusion set which will vary in size depending on the actual word to be excluded and the analytics of words.

Code:

so in this example i use something like this:
'\(X\)([ ][a-z]{4}[^y]|[a-z]{4}[^y])'
which i think i can reduce to:
'\(X\)[ ]?[a-z]{4}[^y]'
and maybe even down to:
'\(X\) ?[a-z]{4}[^y]'

this problem makes for a good exam question...