LinuxQuestions.org - Yet another regex problem

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Yet another regex problem (https://www.linuxquestions.org/questions/programming-9/yet-another-regex-problem-914728/)

Yet another regex problem

without lookarounds how would you make a regex that eliminates as much noise as possible. i have to use grep -E.

i am looking for:
(X)anything
or
(X) anything

but want to exclude noise that matches (or exclude as much as possible)
(X)word
(X) word

You might need to explain more ... what is the difference between 'word' and 'anything'?

"anything" is random string
"word" is specific string

grep'ing for pattern

Code:

"(X)anything" 

or

"(X) anything"

but wish to exclude from that set

Code:

"(X)word" 

or

"(X) word"

as example:

Code:

\(X\)[^w]

will exclude "(X)word" (along with anything that starts with "w"), but will still match "(X)█word" (the block means space for visual, etc)

a regex that eliminates more than desired is ok, just wish to minimize the exclusion set, etc. its a pita w/o lookarounds, just seeing what you guys might suggest.

This is what I came at:

Code:

grep -E "\(X\)[[:blank:]]?[^ w]" file

I don't think it's what you're after as it'll exclude other words starting with "w"

To eliminate the exact word "word":

Code:

grep -E "\(X\)[[:blank:]]?[^ ]" file | grep -v "word"

sycamorex,
excluding all words tha begin with "w" produces an exclusion set so big. i was trying to make that exclusion set as small as possible. i can only also do a single regex using grep -E (no piping or posix char sets available, etc)

i came up with this:

Code:

\(X\)([ ][a-z]{2}[^r]|[a-z]{2}[^r])

i am not 100% if the set of words that have "r" as 3rd char is > or < the set of words that have "w" as 1st char. i guess i would need to find some word analytics and choose the not-char location that would produce the smallest set. this method is a pita especially when the "word" changes, etc.

Could you send a sample data? It migh help

its hard to give the negative, but lets try.

"(X) Happy" or "(X)Happy" is noise in my files, but i want to find pattern that is equiv boolean to this:
"(X)" followed by NOT "Happy", OR, "(X)" followed by "single space" followed by NOT "Happy"

its a pita w/o lookarounds, so the only way i see is to build regex that gives smallest exclusion set.

sample file

Code:

(X)Happy (X) Happy

(X)Trumpet

(X)Hotcakes

(X) Hamper (X)Happy

(X)Rockets

(X)Apple

Why not just do:

Code:

grep -vE '(X).?Happy' <file>

Quote:

Originally Posted by Cedrik (Post 4530067)

Why not just do:

Code:

grep -vE '(X).?Happy' <file>

Well, it depends on the whole sample data.

It also matches:
- (X) word (2 or more spaces after (X))
- and lines NOT starting with (X)

...if such lines exist.

Is the actual data the OP provided accurate and representative of the whole file?

I am with Cedric except that the dot should simply be a space. Until provided reasons why it is not acceptable it does answer the present question:

Code:

grep -vE '(X) ?Happy' file

i dont want lines that dont have the noise and not my wants.

let me try and clarify.

the search tool is a "grep -E" equivalent, so i do not have -v option, or ability to pipe, etc.

Code:

grep searches (X)Happy the named input FILEs 

(or (X) Frown standard input if no files are 

named, or if a (X)puppy single hyphen-minus 

(-) is (X) Happy given as file name) for lines 

containing a match to the (X)Happy given PATTERN. 

By default, grep prints the matching lines.

(X)Happy

In addition, two variant programs egrep 

and fgrep are available. (X) Happy egrep is the 

same as (X)Chuck grep -E. fgrep is the same as 

grep -F. Direct (X) Pencil invocation as either 

egrep or (X)Happy fgrep is deprecated, but is 

provided to allow (X) Happy historical applications 

that rely on them (X) Denny to run unmodified.

noise = "(X)Happy" or "(X) Happy"
hit is "(X)[space][word]" or "(X)[word]"

the sample file above has 14 lines.
if i had lookarounds i would get:
1 no match
2 match for "(X) Frown"
3 match for "(X)puppy"
4 no match
5 no match
6 no match
7 no match
8 no match
9 no match
10 match for "(X)Chuck"
11 match for "(X) Pencil"
12 no match
13 no match
14 match for "(X) Denny"

so w/o lookarounds using "grep -E '/regex/' file" i only see a way to build an exclusion set which will vary in size depending on the actual word to be excluded and the analytics of words.

Code:

so in this example i use something like this:

'\(X\)([ ][a-z]{4}[^y]|[a-z]{4}[^y])'

which i think i can reduce to:

'\(X\)[ ]?[a-z]{4}[^y]'

and maybe even down to:

'\(X\) ?[a-z]{4}[^y]'

this problem makes for a good exam question...