LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Yet another regex problem (https://www.linuxquestions.org/questions/programming-9/yet-another-regex-problem-914728/)

Linux_Kidd 11-21-2011 11:37 AM

Yet another regex problem
 
without lookarounds how would you make a regex that eliminates as much noise as possible. i have to use grep -E.

i am looking for:
(X)anything
or
(X) anything

but want to exclude noise that matches (or exclude as much as possible)
(X)word
(X) word

grail 11-21-2011 11:44 AM

You might need to explain more ... what is the difference between 'word' and 'anything'?

Linux_Kidd 11-21-2011 12:08 PM

"anything" is random string
"word" is specific string

grep'ing for pattern
Code:

"(X)anything"
or
"(X) anything"

but wish to exclude from that set
Code:

"(X)word"
or
"(X) word"

as example:
Code:

\(X\)[^w]
will exclude "(X)word" (along with anything that starts with "w"), but will still match "(X)█word" (the block means space for visual, etc)

a regex that eliminates more than desired is ok, just wish to minimize the exclusion set, etc. its a pita w/o lookarounds, just seeing what you guys might suggest.

sycamorex 11-21-2011 02:22 PM

This is what I came at:
Code:

grep -E  "\(X\)[[:blank:]]?[^ w]" file
I don't think it's what you're after as it'll exclude other words starting with "w"



To eliminate the exact word "word":
Code:

grep -E  "\(X\)[[:blank:]]?[^ ]" file | grep -v "word"

Linux_Kidd 11-21-2011 03:16 PM

sycamorex,
excluding all words tha begin with "w" produces an exclusion set so big. i was trying to make that exclusion set as small as possible. i can only also do a single regex using grep -E (no piping or posix char sets available, etc)

i came up with this:
Code:

\(X\)([ ][a-z]{2}[^r]|[a-z]{2}[^r])
i am not 100% if the set of words that have "r" as 3rd char is > or < the set of words that have "w" as 1st char. i guess i would need to find some word analytics and choose the not-char location that would produce the smallest set. this method is a pita especially when the "word" changes, etc.

sycamorex 11-21-2011 03:25 PM

Could you send a sample data? It migh help

Linux_Kidd 11-21-2011 03:53 PM

its hard to give the negative, but lets try.

"(X) Happy" or "(X)Happy" is noise in my files, but i want to find pattern that is equiv boolean to this:
"(X)" followed by NOT "Happy", OR, "(X)" followed by "single space" followed by NOT "Happy"

its a pita w/o lookarounds, so the only way i see is to build regex that gives smallest exclusion set.

sample file
Code:

(X)Happy (X) Happy
(X)Trumpet
(X)Hotcakes
(X) Hamper (X)Happy
(X)Rockets
(X)Apple


Cedrik 11-21-2011 04:08 PM

Why not just do:
Code:

grep -vE '(X).?Happy' <file>

sycamorex 11-21-2011 05:43 PM

Quote:

Originally Posted by Cedrik (Post 4530067)
Why not just do:
Code:

grep -vE '(X).?Happy' <file>

Well, it depends on the whole sample data.

It also matches:
- (X) word (2 or more spaces after (X))
- and lines NOT starting with (X)

...if such lines exist.

Is the actual data the OP provided accurate and representative of the whole file?

grail 11-21-2011 07:04 PM

I am with Cedric except that the dot should simply be a space. Until provided reasons why it is not acceptable it does answer the present question:
Code:

grep -vE '(X) ?Happy' file

Linux_Kidd 11-21-2011 07:26 PM

i dont want lines that dont have the noise and not my wants.

let me try and clarify.

the search tool is a "grep -E" equivalent, so i do not have -v option, or ability to pipe, etc.
Code:

grep searches (X)Happy the named input FILEs
(or (X) Frown standard input if no files are
named, or if a (X)puppy single hyphen-minus
(-) is (X) Happy given as file name) for lines
containing a match to the (X)Happy given PATTERN.
By default, grep prints the matching lines.
(X)Happy
In addition, two variant programs egrep
and fgrep are available. (X) Happy egrep is the
same as (X)Chuck grep -E. fgrep is the same as
grep -F. Direct (X) Pencil invocation as either
egrep or (X)Happy fgrep is deprecated, but is
provided to allow (X) Happy historical applications
that rely on them (X) Denny to run unmodified.

noise = "(X)Happy" or "(X) Happy"
hit is "(X)[space][word]" or "(X)[word]"

the sample file above has 14 lines.
if i had lookarounds i would get:
1 no match
2 match for "(X) Frown"
3 match for "(X)puppy"
4 no match
5 no match
6 no match
7 no match
8 no match
9 no match
10 match for "(X)Chuck"
11 match for "(X) Pencil"
12 no match
13 no match
14 match for "(X) Denny"

so w/o lookarounds using "grep -E '/regex/' file" i only see a way to build an exclusion set which will vary in size depending on the actual word to be excluded and the analytics of words.
Code:

so in this example i use something like this:
'\(X\)([ ][a-z]{4}[^y]|[a-z]{4}[^y])'
which i think i can reduce to:
'\(X\)[ ]?[a-z]{4}[^y]'
and maybe even down to:
'\(X\) ?[a-z]{4}[^y]'

this problem makes for a good exam question...


All times are GMT -5. The time now is 07:06 PM.