LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   grep ' book.* ' file (https://www.linuxquestions.org/questions/linux-newbie-8/grep-book-%2A-file-4175595344/)

vincix 12-13-2016 09:39 AM

grep ' book.* ' file
 
grep ' book.* ' file.txt (red is the match):

Code:

books book great books book
 bookssss with space
 bookq          too many spaces
A simple book without punctuation

(keep this for consistency: there are about 11 spaces between 'bookq' and 'too many' - I don't understand why it doesn't show them)

So, my question is, how come the word that follows the 'book' string is also matched? Shouldn't * match none or however many characters that precede it, i.e. the period, which is either nothing (if you have simply 'book') or, let's say, 'q' (like in the third case).

I'm not sure how * works in this case. Does it mean that it can be followed by anything? Then where should the match stop?

rknichols 12-13-2016 09:48 AM

Quote:

Originally Posted by vincix (Post 5641581)
grep ' book.* ' file.txt (red is the match):

books book great books book
bookssss with space
bookq too many spaces
A simple book without punctuation

(there are about 11 spaces between 'bookq' and 'too many' - I don't understand why it doesn't show them)

Where formatting is important, wrap your text in [CODE] ... [/CODE] tags. The "#" icon in the tools will do that.
Quote:

So, my question is, how come the word that follows the 'book' string is also matched? Shouldn't * match none or however many characters that precede it, i.e. the period, which is either nothing (if you have simply 'book') or, let's say, 'q' (like in the third case).
A "." in a regex matches any character, and ".*" matches any number of any characters. The match is greedy, and will match as many characters as it can while still allowing the overall expression to match. In your case, the only other requirement is a space character, so the ".*" will include everything up to (but not including) the last space character in the line.

vincix 12-13-2016 09:54 AM

Why isn't the last space not included, given that there is a space after book.*?

rknichols 12-13-2016 10:02 AM

Quote:

Originally Posted by vincix (Post 5641591)
Why isn't the last space not included, given that there is a space after book.*?

The space is included in the match. It is matched by the literal space in the expression, not by the ".*". If the ".*" did include that space, then the overall match would fail because there would be nothing to match the literal space at the end of the expression.

vincix 12-13-2016 12:35 PM

".*" could also mean nothing, could it not? i.e. could mean a space (which is not included in ".*"). So in this line "A simple book without punctuation", why is " without" also included (i.e. space + without)? And I suppose the space after "without" is also included, isn't it?

szboardstretcher 12-13-2016 12:45 PM

Code:

. means any character
* means any number of character

so
Code:

.* means any number of any characters
and

Code:

book.* means book(any number of any characters up to the newline, and yes, space is a character)
replies 2 and 4 go over this. If that is not sufficient, could you explain further your question?

A wonderful resource for testing out and learning by doing is https://regexone.com/

vincix 12-13-2016 12:54 PM

So you said "the ".*" will include everything up to (but not including) the last space character in the line."
But then you say that the space is included in the match. And I asked you about that last space character in the line. So that's why I feel that your explanation only partially cleared things for me.

If there's a space after .*, then that space is going to be matched, isnt' it? Will that be the last space character in the line?

szboardstretcher 12-13-2016 12:57 PM

Lets pretend that '_' character is space so we can see what i mean:

Quote:

this_is_a_sentence_that_is_long_with_spaces_at_the_end________
grep 'sentence' will match ONLY the word sentence
'sentence'

grep 'sentence.' will match the word sentence AND one additional character (the space)
'sentence_'

grep 'sentence.*' will match the word sentence AND any number of characters up to the new line
'sentence_that_is_long_with_spaces_at_the_end________'

vincix 12-13-2016 01:13 PM

Well, that's exactly it. You didn't illustrate the difference between " book.* " and "book.* ". I mean, I can understand that ".*" matches up to the end of the line, that's not the problem. But I think it is trickier to understand the space after "book.*"

In your example, grep '*sentence.*' is going to be equivalent to grep 'sentence.* ', is it not?:) Or, to be more accurate, the latter is only going to match everything up only to the first space after the word "end" - which, of course, we don't see.

So for instance:
grep ' book.* ' file.txt
A book with a space at the end of the line_
A book without a space at the end of the_line


Whereas grep ' book.*' file.txt highlights everything in both cases. (which is by now clear)

My conclusion is that " book.* " stops at the last space of the line, but it also includes it.

vincix 12-13-2016 01:39 PM

What I find frustrating is that the expression doesn't stop at the first space (and including it). That's how I'd have seen it and that's why I feel it's rather unintuitive.

c0wb0y 12-13-2016 01:50 PM

Code:

.* = goobles up everything from here here untile end of line (or multiline).
A* = gobbles up all 'A's from here until the end of line (or multiline). If none found, stop.
'test *' = match the word 'test' optionally followed by space(s). Stop when no more spaces can be gobbled up.

Just like they said, space is included in the match.

szboardstretcher 12-13-2016 01:56 PM

Also, you can escape spaces:

grep 'sentence\ ' will grep only instances where sentence has a space after it.

rknichols 12-13-2016 04:33 PM

For grep, the space character has no special significance. It's just another character. Perhaps it's easier to think about the character "x" instead of space. The expression
Code:

'book.*x'
will match the string "book" and any number of subsequent characters up to the last occurrence of "x" in the line. The ".*" will match everything up to but not including that final "x", and then the "x" in the expression matches itself. (And if there is no "x", the match will fail.)

If you want to match anything except a space, you have to write the expression that way:
Code:

grep 'book[^ ]*'
That will match "book", "book.", "books", "bookkeeper", "bookend", "book_index[s]->pagenum", etc., but will not include a space character or anything that follows.

vincix 12-14-2016 01:31 AM

Yes, indeed, it's clear. Thanks for summing it up :)

pan64 12-14-2016 01:57 AM

I see you are trying to understand how regexp works in general. There are a lot of resources on the net to test/check/try/practice, but actually I would like to suggest you a few:
http://www.regexpal.com/ (you can use your mouse for explanation)
http://www.regexr.com/
http://www.myregexp.com/


All times are GMT -5. The time now is 11:24 PM.