LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-21-2011, 11:37 AM   #1
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Rep: Reputation: 78
Yet another regex problem


without lookarounds how would you make a regex that eliminates as much noise as possible. i have to use grep -E.

i am looking for:
(X)anything
or
(X) anything

but want to exclude noise that matches (or exclude as much as possible)
(X)word
(X) word
 
Old 11-21-2011, 11:44 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
You might need to explain more ... what is the difference between 'word' and 'anything'?
 
Old 11-21-2011, 12:08 PM   #3
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
"anything" is random string
"word" is specific string

grep'ing for pattern
Code:
"(X)anything" 
or
"(X) anything"
but wish to exclude from that set
Code:
"(X)word" 
or
"(X) word"
as example:
Code:
\(X\)[^w]
will exclude "(X)word" (along with anything that starts with "w"), but will still match "(X)█word" (the block means space for visual, etc)

a regex that eliminates more than desired is ok, just wish to minimize the exclusion set, etc. its a pita w/o lookarounds, just seeing what you guys might suggest.

Last edited by Linux_Kidd; 11-21-2011 at 12:52 PM.
 
Old 11-21-2011, 02:22 PM   #4
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
This is what I came at:
Code:
grep -E  "\(X\)[[:blank:]]?[^ w]" file
I don't think it's what you're after as it'll exclude other words starting with "w"



To eliminate the exact word "word":
Code:
grep -E  "\(X\)[[:blank:]]?[^ ]" file | grep -v "word"
 
Old 11-21-2011, 03:16 PM   #5
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
sycamorex,
excluding all words tha begin with "w" produces an exclusion set so big. i was trying to make that exclusion set as small as possible. i can only also do a single regex using grep -E (no piping or posix char sets available, etc)

i came up with this:
Code:
\(X\)([ ][a-z]{2}[^r]|[a-z]{2}[^r])
i am not 100% if the set of words that have "r" as 3rd char is > or < the set of words that have "w" as 1st char. i guess i would need to find some word analytics and choose the not-char location that would produce the smallest set. this method is a pita especially when the "word" changes, etc.

Last edited by Linux_Kidd; 11-21-2011 at 03:47 PM.
 
Old 11-21-2011, 03:25 PM   #6
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Could you send a sample data? It migh help
 
Old 11-21-2011, 03:53 PM   #7
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
its hard to give the negative, but lets try.

"(X) Happy" or "(X)Happy" is noise in my files, but i want to find pattern that is equiv boolean to this:
"(X)" followed by NOT "Happy", OR, "(X)" followed by "single space" followed by NOT "Happy"

its a pita w/o lookarounds, so the only way i see is to build regex that gives smallest exclusion set.

sample file
Code:
(X)Happy (X) Happy
(X)Trumpet
(X)Hotcakes
(X) Hamper (X)Happy
(X)Rockets
(X)Apple

Last edited by Linux_Kidd; 11-21-2011 at 03:55 PM.
 
Old 11-21-2011, 04:08 PM   #8
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
Why not just do:
Code:
grep -vE '(X).?Happy' <file>
 
Old 11-21-2011, 05:43 PM   #9
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by Cedrik View Post
Why not just do:
Code:
grep -vE '(X).?Happy' <file>
Well, it depends on the whole sample data.

It also matches:
- (X) word (2 or more spaces after (X))
- and lines NOT starting with (X)

...if such lines exist.

Is the actual data the OP provided accurate and representative of the whole file?
 
Old 11-21-2011, 07:04 PM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I am with Cedric except that the dot should simply be a space. Until provided reasons why it is not acceptable it does answer the present question:
Code:
grep -vE '(X) ?Happy' file
 
Old 11-21-2011, 07:26 PM   #11
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 737

Original Poster
Rep: Reputation: 78
i dont want lines that dont have the noise and not my wants.

let me try and clarify.

the search tool is a "grep -E" equivalent, so i do not have -v option, or ability to pipe, etc.
Code:
grep searches (X)Happy the named input FILEs 
(or (X) Frown standard input if no files are 
named, or if a (X)puppy single hyphen-minus 
(-) is (X) Happy given as file name) for lines 
containing a match to the (X)Happy given PATTERN. 
By default, grep prints the matching lines.
(X)Happy
In addition, two variant programs egrep 
and fgrep are available. (X) Happy egrep is the 
same as (X)Chuck grep -E. fgrep is the same as 
grep -F. Direct (X) Pencil invocation as either 
egrep or (X)Happy fgrep is deprecated, but is 
provided to allow (X) Happy historical applications 
that rely on them (X) Denny to run unmodified.
noise = "(X)Happy" or "(X) Happy"
hit is "(X)[space][word]" or "(X)[word]"

the sample file above has 14 lines.
if i had lookarounds i would get:
1 no match
2 match for "(X) Frown"
3 match for "(X)puppy"
4 no match
5 no match
6 no match
7 no match
8 no match
9 no match
10 match for "(X)Chuck"
11 match for "(X) Pencil"
12 no match
13 no match
14 match for "(X) Denny"

so w/o lookarounds using "grep -E '/regex/' file" i only see a way to build an exclusion set which will vary in size depending on the actual word to be excluded and the analytics of words.
Code:
so in this example i use something like this:
'\(X\)([ ][a-z]{4}[^y]|[a-z]{4}[^y])'
which i think i can reduce to:
'\(X\)[ ]?[a-z]{4}[^y]'
and maybe even down to:
'\(X\) ?[a-z]{4}[^y]'
this problem makes for a good exam question...

Last edited by Linux_Kidd; 11-22-2011 at 07:38 AM.
 
  


Reply

Tags
regex, regular expression


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
[SOLVED] one more regex problem sorry ted_chou12 Programming 8 04-08-2011 11:33 AM
Perl to find regex and print following 5 lines after regex casperdaghost Linux - Newbie 3 08-29-2010 08:08 PM
Problem with RegEx using sed citygrid Linux - Newbie 8 03-27-2010 09:17 PM
regex with sed to process file, need help on regex dwynter Linux - Newbie 5 08-31-2007 05:10 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration