LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   grep to select lines with M in last word (https://www.linuxquestions.org/questions/linux-newbie-8/grep-to-select-lines-with-m-in-last-word-866850/)

danielbmartin 03-06-2011 08:47 PM

grep to select lines with M in last word
 
I have a large file in which each line has three or more blank-delimited words. I'd like to code a grep to keep only those lines which have the letter M in the last word. If it's any help, the M (if present) will be the first character in the last word.

quanta 03-06-2011 09:53 PM

Something like this:
Code:

awk '{ if ($NF ~ /M/) print $0 }' input
NF: stand for Number of Field.

grail 03-06-2011 10:50 PM

Or grep could be:
Code:

egrep 'M[^ ]*$' file

quanta 03-06-2011 11:24 PM

Quote:

Originally Posted by grail (Post 4281087)
Or grep could be:
Code:

egrep 'M[^ ]*$' file

I like your solution. It is better than mine which is conventional thinking.

grail 03-07-2011 12:15 AM

Not better .. just different .. I am normally the awk proponent but as you beat me to it I was happy to give an alternative :)

Telengard 03-07-2011 02:16 AM

There is at least one case where egrep fails.
 
Quote:

Originally Posted by grail (Post 4281087)
Code:

egrep 'M[^ ]*$' file

Code:

~$ echo -e "Fable Mabel\nHairy Mary \nMary Martian"
Fable Mabel
Hairy Mary
Mary Martian
~$ echo -e "Fable Mabel\nHairy Mary \nMary Martian" | egrep 'M[^ ]*$'
Fable Mabel
Mary Martian
~$

Note the space at the end of the second line. Is Mary not to be considered a word just because it is followed by an errant space character?

Many human generated text files contain random, unnecessary space characters. They tend to accumulate in places where they go unnoticed, such as adjacent to other whitespace.

It matters even more when processing the output of other programs. For one example, ifconfig likes to add space characters before newline characters.

Code:

~$ ifconfig | hd | grep '20 20 0a'
00000210  6c 20 4c 6f 6f 70 62 61  63 6b 20 20 0a 20 20 20  |l Loopback  .  |
~$

In cases where this matters the awk command is almost certainly to be preferred.

Code:

~$ echo -e "Fable Mabel\nHairy Mary \nMary Martian" | awk '$NF~"M"'
Fable Mabel
Hairy Mary
Mary Martian
~$


danielbmartin 03-07-2011 06:51 AM

Quote:

Originally Posted by grail (Post 4281087)
Or grep could be:
Code:

egrep 'M[^ ]*$' file

This works, and I'm dazzled!

Please give a bit of explanation, and then I will mark this puppy as solved.

My newbie reading is this ...
The M is the character which governs selection or rejection.
The [^ ] says "apply this logic to strings starting with blank."
The * says "apply this logic to all such strings in each line."
The $ says "the last string in each line is the only important one."

Please revise this narrative to make it more correct and instructive.

Thank you!

David the H. 03-07-2011 07:35 AM

Quote:

Originally Posted by danielbmartin (Post 4281423)
Code:

egrep 'M[^ ]*$' file
My newbie reading is this ...
The M is the character which governs selection or rejection.
The [^ ] says "apply this logic to strings starting with blank."
The * says "apply this logic to all such strings in each line."
The $ says "the last string in each line is the only important one."

Not quite. * in regex means "zero or more of the previous character". And in this case, the previous character is [^ ], "not a space". So in layman's English, it could be read as "M, followed by any number of non-space characters, followed by a newline".

As pointed out, it would not match if there happen to be any spaces between the last word and the end of the line.

To catch that, you need to make a small modification.
Code:

egrep 'M[^[:space:]]*[[:space:]]*$'
So this would read as "M, followed by zero or more non-space characters, followed by zero or more spaces, followed by a newline"

I also replaced the simple space with the [:space:] character class here, meaning any kind of whitespace, so tabs would be matched in addition to regular spaces, although it's likely not necessary for your situation.

danielbmartin 03-07-2011 10:37 AM

Quote:

Originally Posted by David the H. (Post 4281452)
Not quite. * in regex means "zero or more of the previous character". And in this case, the previous character is [^ ], "not a space". So in layman's English, it could be read as "M, followed by any number of non-space characters, followed by a newline".

As pointed out, it would not match if there happen to be any spaces between the last word and the end of the line.

To catch that, you need to make a small modification.
Code:

egrep 'M[^[:space:]]*[[:space:]]*$'
So this would read as "M, followed by zero or more non-space characters, followed by zero or more spaces, followed by a newline"

I also replaced the simple space with the [:space:] character class here, meaning any kind of whitespace, so tabs would be matched in addition to regular spaces, although it's likely not necessary for your situation.

Thank you for this clear explanation. This question is SOLVED!

Telengard 03-07-2011 11:57 AM

Quote:

Originally Posted by David the H. (Post 4281452)
Code:

egrep 'M[^[:space:]]*[[:space:]]*$'

awk already considers tabs and normal space characters to be whitespace. My awk command is only 13 characters, while your egrep weighs in at a whopping 35 characters. How is your egrep different or better than this?

Code:

awk '$NF~"M"'

Dark_Helmet 03-07-2011 02:20 PM

My mother always told me not to stick my nose where it doesn't belong, but I don't listen to my mother very often.

Quote:

Originally Posted by Telengard
awk already considers tabs and normal space characters to be whitespace. My awk command is only 13 characters, while your egrep weighs in at a whopping 35 characters.

First, congratulations. Though, I must admit I missed the memo that Jeremy sent out that turned the forums into a competition.

Second, it wasn't David the H's command originally, but a follow-on to grail's.

Quote:

Originally Posted by Telengard
How is your egrep different or better than this?

How is it different? Well, grep is not awk. Therefore the commands are different.

How is it better? To quote the original post:
Quote:

Originally Posted by danielbmartin
I'd like to code a grep to keep only those lines which have the letter M in the last word.

So, a grep solution is better than an awk solution because the OP wanted grep. The OP did not ask for awk. The OP did not ask for an open-ended solution. The OP asked for grep.

So I assume the next time you ask someone to pass the salt you'll be happy when they give you pepper.

Relax... seriously. Don't get so defensive.

szboardstretcher 03-07-2011 02:26 PM

Code:

grep 'M[^ ]*$' filename
Quote:

$ (Question) = match expression at the end of a line, as in A$.
[^ ] = match any one character except those enclosed in [ ], as in [^0-9].
* (Asterisk) = match zero or more of the preceding character or expression.

Telengard 03-07-2011 03:43 PM

Quote:

Originally Posted by Dark_Helmet (Post 4281844)
So, a grep solution is better than an awk solution because the OP wanted grep.

That's the only thing you said that makes sense to me. Thanks for pointing it out though. It is a valid reason to choose grep.

As for the rest of your message, it seems to be an unwarranted attempt to inject personal animosity into an otherwise friendly discussion. I see that you are ranked senior member, so I'm guessing you didn't get to be one <moderated>. I'll just say your message reads like a personal attack, although I don't claim it is.

Quote:

Relax... seriously. Don't get so defensive.
What are you even referring to? Seriously, I don't get it. I'm rereading my message right now, and I don't see how it comes off as defensive at all.
:scratch:

Anyway, this thread isn't a good place for you and I to make friends. Feel free to PM me if you don't want to change the topic of the OP.

colucix 03-07-2011 04:19 PM

This thread is going a bit off-topic! Please, keep discussion fair and reasonable. No need to be pedantic, disrespectful or - even worse - offensive towards other members. The OP already gained proper answers and hopefully learned something useful about regular expressions. Nuff' said!

Dark_Helmet 03-07-2011 04:38 PM

I'll unsubscribe from this thread immediately after this response. It's my hope that this won't be taken as offensive in any way--merely an explanation for my original response.

The original response from Telengard that I quoted (#10) accused the egrep command from David the H. (which was a modified version of grail's original command that addressed Telengard's comments about end-of-line spaces) as being worse than Telengard's own awk command--and did so by saying the egrep used a "whopping" number of characters. It's clear that "whopping" was not used in a complimentary way.

In my term here at LQ, I've participated in a number of threads. I have offered a number of solutions to problems. I cannot recall any instance where I accused another member's proposed solution as being worse or inferior to my own. I have pointed out technical problems with solutions posted, but I have never complained when a subsequent version of the command is posted to address those concerns. I simply, quietly let the OP decide which solution they want to use.

To me, such a complaint is an attack on the proposed solution in an effort to defend some other solution as "better." To me, that is unwarranted defensive behavior. We're all here to contribute and it's not about the "best" solution or the most "efficient" unless that's what the OP asks for.

David the H. 03-07-2011 04:47 PM

It's generally good to give a variety of solutions, even if only for educational purposes, which is the main reason I posted what I did. I agree that in this particular case awk may indeed be the more efficient solution, but I was more concerned with demonstrating how regular expressions work than in what works "best" here. And it may help some reader down the line working on something similar; someone who doesn't have access to or understand how to use awk, for example, or who needs a regex solution specifically.

colucix 03-07-2011 05:01 PM

Dark Helmet, your arguments are reasonable and well phrased. Evidently, the slightly sarcastic tone you used in your previous post has not been well accepted by Telengard and he overreacted to criticism. Now that your reasons are more clear, we hope that the does not lead to anything discussion will terminate and that you will shake your hands.

colucix 03-07-2011 05:06 PM

David the H., thank you for clarifying.

fancylad 03-07-2011 07:26 PM

what about using perl?

Quote:

matt@amd:~$echo -e "Fable Mabel\nHairy Mary \nMary Martian" | perl -lane 'print if $F[1] =~/^M/'
Fable Mabel
Hairy Mary
Mary Martian
:)

kurumi 03-07-2011 08:08 PM

Quote:

Originally Posted by Telengard (Post 4281715)
awk already considers tabs and normal space characters to be whitespace. My awk command is only 13 characters, while your egrep weighs in at a whopping 35 characters. How is your egrep different or better than this?

Code:

awk '$NF~"M"'

use // instead of ". and your regex is wrong.
Code:

awk '$NF~/^M/'

colucix 03-08-2011 02:57 AM

Quote:

Originally Posted by kurumi (Post 4282152)
use // instead of ". and your regex is wrong.
Code:

awk '$NF~/^M/'

Please let's stop to address other members' solutions as wrong or worse. The difference between
Code:

'$NF ~ "M"'
and
Code:

'$NF ~ /^M/'
is that the former uses a so-called dynamic regexp (even if it's actually a string constant), the latter uses a regexp constant. In both cases the expression is very simple and shouldn't bring to any problem, but there are some situations where the usage of one is preferred to the other. See http://www.gnu.org/software/gawk/man...mputed-Regexps for an insight. Furthermore the additional anchor ^ is not really needed as per OP's requirements, since the M (if present) will be the first character in the last word. Checking if it is present should be enough.

Please note the thread has been marked as solved by the OP, therefore let's add further information only if they are really valuable and give an actual improvement to the technical discussion. Thanks.


All times are GMT -5. The time now is 05:26 AM.