LinuxQuestions.org - [SOLVED] Elimination of lines having fewer than 3 words

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Elimination of lines having fewer than 3 words (https://www.linuxquestions.org/questions/programming-9/elimination-of-lines-having-fewer-than-3-words-4175422925/)

danielbmartin

08-19-2012 07:44 PM

Elimination of lines having fewer than 3 words

Have:

Code:

how now

now is the time for

now

 

  holy  cow  

  the quick  brown fox  

 jumped over

the candlestick

Want:

Code:

now is the time for

  the quick  brown fox

I've been fumbling with variations on this...

Code:

sed -r '/\w{3}.+/p' $InFile

... without success.

Please advise.

Daniel B. Martin

byannoni

08-19-2012 07:50 PM

This is off the top of my head, if it doesn't work for you I'll be happy to develop it further:

Code:

awk -F'\\s*' 'NF > 4'

Edit:
Actually, this works better:

Code:

perl -ne 'print $_ if $_ =~ /\s*(?:\w+\s+){2,}\w+/'

Here is an equivalent awk for the Perl:

Code:

awk '$0 ~ /\s*(\w+\s+){2,}\w+/'

lyle_s

08-19-2012 08:09 PM

Here's what I had in mind:

Code:

#!/bin/bash



while read

do

        if [ $(echo "$REPLY" | wc --words) -ge 3 ]

        then

                echo "$REPLY"

        fi

done

Code:

lyle@bowman:~/programming/sh$ ./lines < words.test 

now is the time for

  the quick  brown fox  

one two three

I added a line with 3 words to your sample data.

No awk/sed fancyness though.

Lyle.

danielbmartin

08-19-2012 09:07 PM

This didn't work ...

Code:

awk -F'\\s*' 'NF > 4'

... but you put me on the right track.
This does the job nicely ...

Code:

awk 'NF > 2'

Thank you.

Daniel B. Martin

firstfire

08-19-2012 11:55 PM

Hi.

Using egrep (or grep -E):

Code:

$ cat infile

how now

now is the time for

now

 

  holy  cow  

  the quick  brown fox  

 jumped over

the candlestick

$ egrep '(\w+ +){3}' infile

now is the time for

  the quick  brown fox

The same with basic RE:

Code:

grep '$\w\+ \+$\{3\}' infile

danielbmartin

08-20-2012 08:53 AM

[QUOTE=firstfire;4758746]

Code:

$ egrep '(\w+ +){3}' infile

now is the time for

  the quick  brown fox

This works but I don't understand it. Please elaborate.
This is my (mis)understanding.

Code:

{3} means 3 instances of (\w+ +) 

\w means "a word"

 + means "zero or more blanks"

Why is there a + following \w?

Daniel B. Martin

grail

08-20-2012 10:24 AM

Not quite:

Code:

\w means a word character class ... ie same as [[:alnum:]]

+ means one or more

The issue with the code example given is if the line contains only 3 words there will be no space at the end hence it will fail

firstfire

08-20-2012 10:30 AM

Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):

Code:

egrep '(\w+ *){3,}'

This regular expression match a string consisting of three or more words, each followed by zero or more spaces, that is how a three-or-more-words string looks like.

Previous attempt (with ' +') worked on your sample data because there were no line with exactly 3 words. If that would be the case, then there must be at least one space after last word for that RE to work:

Code:

$ echo 'a b c' | egrep '(\w+ +){3}'

$ echo 'a b c ' | egrep '(\w+ +){3}'

a b c 

$ echo 'a b c ' | egrep '(\w+ *){3}'

a b c

$ echo 'a b c' | egrep '(\w+ *){3}'

a b c

Note last space after 'c'.

EDIT: grail beats me again :)

danielbmartin

08-20-2012 10:47 AM

Quote:

Originally Posted by firstfire (Post 4759270)

Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):

Code:

egrep '(\w+ *){3,}'

Code:

$ echo 'a b c' | egrep '(\w+ +){3}'

$ echo 'a b c ' | egrep '(\w+ +){3}'

a b c 

$ echo 'a b c ' | egrep '(\w+ *){3}'

a b c

$ echo 'a b c' | egrep '(\w+ *){3}'

a b c

Note last space after 'c'.

EDIT: grail beats me again :)

The first code line fails but the code in the examples is different, and it works. Is there a t7po?

Daniel B. Martin

grail

08-20-2012 11:17 AM

You might need to be a bit more specific daniel about which first line of code you are referring to?

firstfire

08-20-2012 12:31 PM

Hi, Daniel.

Again, I'm wrong:

Code:

$ echo 'how now'| sed -r  's/(\w+ *)(\w+ *)(\w+ *)/\1:\2:\3/'

how :no:w

So '(\w+ *){3}' is bad. It looks like the only way to do this using RE is to treat last word separately:

Code:

$ egrep '(\w+ +){2}\w' infile

now is the time for

  the quick  brown fox

I apologize for misleading posts. Shame on me :redface:

danielbmartin

08-20-2012 01:14 PM

[QUOTE=firstfire;4759364]It looks like the only way to do this using RE is to treat last word separately:

Code:

$ egrep '(\w+ +){2}\w' infile

This one is good.

Quote:

I apologize for misleading posts. Shame on me :redface:

You are forgiven. It has been a learning experience for both of us.

I'd mark this thread as SOLVED but it already wears that badge of honor.

Daniel B. Martin

All times are GMT -5. The time now is 02:20 AM.