[SOLVED] grep ' book.* ' file

vincix · 12-13-2016, 09:39 AM

grep ' book.* ' file.txt (red is the match):

Code:

books book great books book
 bookssss with space
 bookq           too many spaces
A simple book without punctuation

(keep this for consistency: there are about 11 spaces between 'bookq' and 'too many' - I don't understand why it doesn't show them)

So, my question is, how come the word that follows the 'book' string is also matched? Shouldn't * match none or however many characters that precede it, i.e. the period, which is either nothing (if you have simply 'book') or, let's say, 'q' (like in the third case).

I'm not sure how * works in this case. Does it mean that it can be followed by anything? Then where should the match stop?

rknichols · 12-13-2016, 09:48 AM

Quote:

Originally Posted by vincix

grep ' book.* ' file.txt (red is the match):

books book great books book
bookssss with space
bookq too many spaces
A simple book without punctuation

(there are about 11 spaces between 'bookq' and 'too many' - I don't understand why it doesn't show them)

Where formatting is important, wrap your text in [CODE] ... [/CODE] tags. The "#" icon in the tools will do that.

Quote:

So, my question is, how come the word that follows the 'book' string is also matched? Shouldn't * match none or however many characters that precede it, i.e. the period, which is either nothing (if you have simply 'book') or, let's say, 'q' (like in the third case).

A "." in a regex matches any character, and ".*" matches any number of any characters. The match is greedy, and will match as many characters as it can while still allowing the overall expression to match. In your case, the only other requirement is a space character, so the ".*" will include everything up to (but not including) the last space character in the line.

vincix · 12-13-2016, 09:54 AM

Why isn't the last space not included, given that there is a space after book.*?

rknichols · 12-13-2016, 10:02 AM

Quote:

Originally Posted by vincix

Why isn't the last space not included, given that there is a space after book.*?

The space is included in the match. It is matched by the literal space in the expression, not by the ".*". If the ".*" did include that space, then the overall match would fail because there would be nothing to match the literal space at the end of the expression.

vincix · 12-13-2016, 12:35 PM

".*" could also mean nothing, could it not? i.e. could mean a space (which is not included in ".*"). So in this line "A simple book without punctuation", why is " without" also included (i.e. space + without)? And I suppose the space after "without" is also included, isn't it?

szboardstretcher · 12-13-2016, 12:45 PM

Code:

. means any character
* means any number of character

so

Code:

.* means any number of any characters

and

Code:

book.* means book(any number of any characters up to the newline, and yes, space is a character)

replies 2 and 4 go over this. If that is not sufficient, could you explain further your question?

A wonderful resource for testing out and learning by doing is https://regexone.com/

vincix · 12-13-2016, 12:54 PM

So you said "the ".*" will include everything up to (but not including) the last space character in the line."
But then you say that the space is included in the match. And I asked you about that last space character in the line. So that's why I feel that your explanation only partially cleared things for me.

If there's a space after .*, then that space is going to be matched, isnt' it? Will that be the last space character in the line?

szboardstretcher · 12-13-2016, 12:57 PM

Lets pretend that '_' character is space so we can see what i mean:

Quote:

this_is_a_sentence_that_is_long_with_spaces_at_the_end________

grep 'sentence' will match ONLY the word sentence
'sentence'

grep 'sentence.' will match the word sentence AND one additional character (the space)
'sentence_'

grep 'sentence.*' will match the word sentence AND any number of characters up to the new line
'sentence_that_is_long_with_spaces_at_the_end________'

vincix · 12-13-2016, 01:13 PM

Well, that's exactly it. You didn't illustrate the difference between " book.* " and "book.* ". I mean, I can understand that ".*" matches up to the end of the line, that's not the problem. But I think it is trickier to understand the space after "book.*"

In your example, grep '*sentence.*' is going to be equivalent to grep 'sentence.* ', is it not?

Or, to be more accurate, the latter is only going to match everything up only to the first space after the word "end" - which, of course, we don't see.

So for instance:
grep ' book.* ' file.txt
A book with a space at the end of the line_
A book without a space at the end of the_line

Whereas grep ' book.*' file.txt highlights everything in both cases. (which is by now clear)

My conclusion is that " book.* " stops at the last space of the line, but it also includes it.

vincix · 12-13-2016, 01:39 PM

What I find frustrating is that the expression doesn't stop at the first space (and including it). That's how I'd have seen it and that's why I feel it's rather unintuitive.

c0wb0y · 12-13-2016, 01:50 PM

Code:

.* = goobles up everything from here here untile end of line (or multiline).
A* = gobbles up all 'A's from here until the end of line (or multiline). If none found, stop.
'test *' = match the word 'test' optionally followed by space(s). Stop when no more spaces can be gobbled up.

Just like they said, space is included in the match.

szboardstretcher · 12-13-2016, 01:56 PM

Also, you can escape spaces:

grep 'sentence\ ' will grep only instances where sentence has a space after it.

rknichols · 12-13-2016, 04:33 PM

For grep, the space character has no special significance. It's just another character. Perhaps it's easier to think about the character "x" instead of space. The expression

Code:

'book.*x'

will match the string "book" and any number of subsequent characters up to the last occurrence of "x" in the line. The ".*" will match everything up to but not including that final "x", and then the "x" in the expression matches itself. (And if there is no "x", the match will fail.)

If you want to match anything except a space, you have to write the expression that way:

Code:

grep 'book[^ ]*'

That will match "book", "book.", "books", "bookkeeper", "bookend", "book_index[s]->pagenum", etc., but will not include a space character or anything that follows.

vincix · 12-14-2016, 01:31 AM

Yes, indeed, it's clear. Thanks for summing it up

pan64 · 12-14-2016, 01:57 AM

I see you are trying to understand how regexp works in general. There are a lot of resources on the net to test/check/try/practice, but actually I would like to suggest you a few:
http://www.regexpal.com/ (you can use your mouse for explanation)
http://www.regexr.com/
http://www.myregexp.com/