[SOLVED] Elimination of lines having fewer than 3 words

danielbmartin · 08-19-2012, 07:44 PM

Have:

Code:

how now
now is the time for
now
 
  holy  cow  
  the quick   brown fox   
 jumped over
the candlestick

Want:

Code:

now is the time for
  the quick   brown fox

I've been fumbling with variations on this...

Code:

sed -r '/\w{3}.+/p' $InFile

... without success.

Please advise.

Daniel B. Martin

byannoni · 08-19-2012, 07:50 PM

This is off the top of my head, if it doesn't work for you I'll be happy to develop it further:

Code:

awk -F'\\s*' 'NF > 4'

Edit:
Actually, this works better:

Code:

perl -ne 'print $_ if $_ =~ /\s*(?:\w+\s+){2,}\w+/'

Here is an equivalent awk for the Perl:

Code:

awk '$0 ~ /\s*(\w+\s+){2,}\w+/'

lyle_s · 08-19-2012, 08:09 PM

Here's what I had in mind:

Code:

#!/bin/bash

while read
do
        if [ $(echo "$REPLY" | wc --words) -ge 3 ]
        then
                echo "$REPLY"
        fi
done

Code:

lyle@bowman:~/programming/sh$ ./lines < words.test 
now is the time for
  the quick   brown fox   
one two three

I added a line with 3 words to your sample data.

No awk/sed fancyness though.

Lyle.

danielbmartin · 08-19-2012, 09:07 PM

This didn't work ...

Code:

awk -F'\\s*' 'NF > 4'

... but you put me on the right track.
This does the job nicely ...

Code:

awk 'NF > 2'

Thank you.

Daniel B. Martin

firstfire · 08-19-2012, 11:55 PM

Hi.

Using egrep (or grep -E):

Code:

$ cat infile
how now
now is the time for
now
 
  holy  cow  
  the quick   brown fox   
 jumped over
the candlestick
$ egrep '(\w+ +){3}' infile
now is the time for
  the quick   brown fox

The same with basic RE:

Code:

grep  '\(\w\+ \+\)\{3\}' infile

danielbmartin · 08-20-2012, 08:53 AM

[QUOTE=firstfire;4758746]

Code:

$ egrep '(\w+ +){3}' infile
now is the time for
  the quick   brown fox

This works but I don't understand it. Please elaborate.
This is my (mis)understanding.

Code:

{3} means 3 instances of (\w+ +) 
\w means "a word"
 + means "zero or more blanks"

Why is there a + following \w?

Daniel B. Martin

grail · 08-20-2012, 10:24 AM

Not quite:

Code:

\w means a word character class ... ie same as [[:alnum:]]
+ means one or more

The issue with the code example given is if the line contains only 3 words there will be no space at the end hence it will fail

firstfire · 08-20-2012, 10:30 AM

Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):

Code:

egrep '(\w+ *){3,}'

This regular expression match a string consisting of three or more words, each followed by zero or more spaces, that is how a three-or-more-words string looks like.

Previous attempt (with ' +') worked on your sample data because there were no line with exactly 3 words. If that would be the case, then there must be at least one space after last word for that RE to work:

Code:

$ echo 'a b c' | egrep '(\w+ +){3}'
$ echo 'a b c ' | egrep '(\w+ +){3}'
a b c 
$ echo 'a b c ' | egrep '(\w+ *){3}'
a b c
$ echo 'a b c' | egrep '(\w+ *){3}'
a b c

Note last space after 'c'.

EDIT: grail beats me again

danielbmartin · 08-20-2012, 10:47 AM

Quote:

Originally Posted by firstfire

Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):

Code:

egrep '(\w+ *){3,}'

This regular expression match a string consisting of three or more words, each followed by zero or more spaces, that is how a three-or-more-words string looks like.

Previous attempt (with ' +') worked on your sample data because there were no line with exactly 3 words. If that would be the case, then there must be at least one space after last word for that RE to work:

Code:

$ echo 'a b c' | egrep '(\w+ +){3}'
$ echo 'a b c ' | egrep '(\w+ +){3}'
a b c 
$ echo 'a b c ' | egrep '(\w+ *){3}'
a b c
$ echo 'a b c' | egrep '(\w+ *){3}'
a b c

Note last space after 'c'.

EDIT: grail beats me again

The first code line fails but the code in the examples is different, and it works. Is there a t7po?

Daniel B. Martin

grail · 08-20-2012, 11:17 AM

You might need to be a bit more specific daniel about which first line of code you are referring to?

firstfire · 08-20-2012, 12:31 PM

Hi, Daniel.

Again, I'm wrong:

Code:

$ echo 'how now'| sed -r  's/(\w+ *)(\w+ *)(\w+ *)/\1:\2:\3/'
how :no:w

So '(\w+ *){3}' is bad. It looks like the only way to do this using RE is to treat last word separately:

Code:

$ egrep '(\w+ +){2}\w' infile
now is the time for
  the quick   brown fox

I apologize for misleading posts. Shame on me

danielbmartin · 08-20-2012, 01:14 PM

[QUOTE=firstfire;4759364]It looks like the only way to do this using RE is to treat last word separately:

Code:

$ egrep '(\w+ +){2}\w' infile

This one is good.

Quote:

I apologize for misleading posts. Shame on me

You are forgiven. It has been a learning experience for both of us.

I'd mark this thread as SOLVED but it already wears that badge of honor.

Daniel B. Martin