LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Elimination of lines having fewer than 3 words (https://www.linuxquestions.org/questions/programming-9/elimination-of-lines-having-fewer-than-3-words-4175422925/)

danielbmartin 08-19-2012 07:44 PM

Elimination of lines having fewer than 3 words
 
Have:
Code:

how now
now is the time for
now
 
  holy  cow 
  the quick  brown fox 
 jumped over
the candlestick

Want:
Code:

now is the time for
  the quick  brown fox

I've been fumbling with variations on this...
Code:

sed -r '/\w{3}.+/p' $InFile
... without success.

Please advise.

Daniel B. Martin

byannoni 08-19-2012 07:50 PM

This is off the top of my head, if it doesn't work for you I'll be happy to develop it further:
Code:

awk -F'\\s*' 'NF > 4'
Edit:
Actually, this works better:
Code:

perl -ne 'print $_ if $_ =~ /\s*(?:\w+\s+){2,}\w+/'
Here is an equivalent awk for the Perl:
Code:

awk '$0 ~ /\s*(\w+\s+){2,}\w+/'

lyle_s 08-19-2012 08:09 PM

Here's what I had in mind:
Code:

#!/bin/bash

while read
do
        if [ $(echo "$REPLY" | wc --words) -ge 3 ]
        then
                echo "$REPLY"
        fi
done

Code:

lyle@bowman:~/programming/sh$ ./lines < words.test
now is the time for
  the quick  brown fox 
one two three

I added a line with 3 words to your sample data.

No awk/sed fancyness though.

Lyle.

danielbmartin 08-19-2012 09:07 PM

This didn't work ...
Code:

awk -F'\\s*' 'NF > 4'
... but you put me on the right track.
This does the job nicely ...
Code:

awk 'NF > 2'
Thank you.

Daniel B. Martin

firstfire 08-19-2012 11:55 PM

Hi.

Using egrep (or grep -E):
Code:

$ cat infile
how now
now is the time for
now
 
  holy  cow 
  the quick  brown fox 
 jumped over
the candlestick
$ egrep '(\w+ +){3}' infile
now is the time for
  the quick  brown fox

The same with basic RE:
Code:

grep  '\(\w\+ \+\)\{3\}' infile

danielbmartin 08-20-2012 08:53 AM

[QUOTE=firstfire;4758746]
Code:

$ egrep '(\w+ +){3}' infile
now is the time for
  the quick  brown fox

This works but I don't understand it. Please elaborate.
This is my (mis)understanding.
Code:

{3} means 3 instances of (\w+ +)
\w means "a word"
 + means "zero or more blanks"

Why is there a + following \w?

Daniel B. Martin

grail 08-20-2012 10:24 AM

Not quite:

Code:

\w means a word character class ... ie same as [[:alnum:]]
+ means one or more

The issue with the code example given is if the line contains only 3 words there will be no space at the end hence it will fail

firstfire 08-20-2012 10:30 AM

Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):
Code:

egrep '(\w+ *){3,}'
This regular expression match a string consisting of three or more words, each followed by zero or more spaces, that is how a three-or-more-words string looks like.

Previous attempt (with ' +') worked on your sample data because there were no line with exactly 3 words. If that would be the case, then there must be at least one space after last word for that RE to work:
Code:

$ echo 'a b c' | egrep '(\w+ +){3}'
$ echo 'a b c ' | egrep '(\w+ +){3}'
a b c
$ echo 'a b c ' | egrep '(\w+ *){3}'
a b c
$ echo 'a b c' | egrep '(\w+ *){3}'
a b c

Note last space after 'c'.

EDIT: grail beats me again :)

danielbmartin 08-20-2012 10:47 AM

Quote:

Originally Posted by firstfire (Post 4759270)
Hi.

Well, as Firefox developers say, this is embarrassing.. There should be '*' (a.k.a. Kleene star -- zero or more) instead of '+' (one or more):
Code:

egrep '(\w+ *){3,}'
This regular expression match a string consisting of three or more words, each followed by zero or more spaces, that is how a three-or-more-words string looks like.

Previous attempt (with ' +') worked on your sample data because there were no line with exactly 3 words. If that would be the case, then there must be at least one space after last word for that RE to work:
Code:

$ echo 'a b c' | egrep '(\w+ +){3}'
$ echo 'a b c ' | egrep '(\w+ +){3}'
a b c
$ echo 'a b c ' | egrep '(\w+ *){3}'
a b c
$ echo 'a b c' | egrep '(\w+ *){3}'
a b c

Note last space after 'c'.

EDIT: grail beats me again :)

The first code line fails but the code in the examples is different, and it works. Is there a t7po?

Daniel B. Martin

grail 08-20-2012 11:17 AM

You might need to be a bit more specific daniel about which first line of code you are referring to?

firstfire 08-20-2012 12:31 PM

Hi, Daniel.

Again, I'm wrong:
Code:

$ echo 'how now'| sed -r  's/(\w+ *)(\w+ *)(\w+ *)/\1:\2:\3/'
how :no:w

So '(\w+ *){3}' is bad. It looks like the only way to do this using RE is to treat last word separately:
Code:

$ egrep '(\w+ +){2}\w' infile
now is the time for
  the quick  brown fox

I apologize for misleading posts. Shame on me :redface:

danielbmartin 08-20-2012 01:14 PM

[QUOTE=firstfire;4759364]It looks like the only way to do this using RE is to treat last word separately:
Code:

$ egrep '(\w+ +){2}\w' infile
This one is good.

Quote:

I apologize for misleading posts. Shame on me :redface:
You are forgiven. It has been a learning experience for both of us.

I'd mark this thread as SOLVED but it already wears that badge of honor.

Daniel B. Martin


All times are GMT -5. The time now is 02:20 AM.