regex (for grep): a dot inside a bracket expression, and a few more questions

dedec0 · 03-23-2018, 03:07 PM

Hello (:

Although I am familiar with fairly complex regexes, I have a little doubt now: if I use a dot inside a bracket expression, may it mean "most chars", as it does in normal places?

Reading '$ man grep # GNU grep 2.27', there is no occurrence of the word dot! A bit strange. But there is a section "Character Classes and Bracket Expressions", which mentions [:alpha:] and others, and also says that "Most meta-characters lose their special meaning inside bracket expressions". I wanted to know which of them does and does not! I would say to other people look in the man page for that...

Further, I want to match everything but a closing bracket, *including* newlines! How to do that in file that possibly includes binary data before that "]"?

One of my first good tries is:

Code:

$ grep 'C\[[PA][av][^]]\+\]' file # does *not* grep newlines

Please point: I am unsure if I need to use '-a' with this kind of file, that may contain binary data (or chars in any encoding) mixed with the brackets and ASCII chars used (coarsely saying here) to make a set of it.

AwesomeMachine · 03-23-2018, 03:14 PM

Grep has a switch to treat as a binary file. If you want to give special characters their special meaning inside brackets, escape them with '\'. I think newline is '\n'.

dedec0 · 03-23-2018, 03:28 PM

Quote:

Originally Posted by AwesomeMachine

Grep has a switch to treat as a binary file.

For the kind of file I said, a text file with mixed chars in any encoding, should I use -a?

Quote:

Originally Posted by AwesomeMachine

If you want to give special characters their special meaning inside brackets, escape them with '\'. I think newline is '\n'.

My question was:

Quote:

Further, I want to match everything but a closing bracket, *including* newlines! How to do that in file that possibly includes binary data before that "]"?

Do not miss that the "everything" I wrote includes chars in any encoding! So writing all of them is not an option. My try for that regex part is:

Code:

[^]]

But it does *not* grep this wanted part of my files:

Code:

C[Palkajsdalsk
laskjdasld
dlaksjdsaldjas]

If any of you prefer to discard the try I showed and give me a completely new one, there is no problem with that.

dedec0 · 03-23-2018, 03:33 PM

From what I said in #1, and from

Code:

$ man grep # GNU grep 2.27

where we read

Code:

       -a, --text
              Process a binary file as if it were text;
              this is equivalent  to
              the --binary-files=text option.

should I use -a for this task?

dedec0 · 03-23-2018, 03:48 PM

'-a' does not change the result for one file, but should I use it?

Code:

$  grep -a 'C\[[PA][av][^]]\+\]' file | md5sum
9621b629612e01686a6a6af3564a62e7  -

$  grep 'C\[[PA][av][^]]\+\]' file | md5sum
9621b629612e01686a6a6af3564a62e7  -

keefaz · 03-23-2018, 03:53 PM

For grepping multiple lines, maybe try
-P (perlre)
-z (lines terminated with NULL instead of newline)
-o (print only matched)

Code:

grep -Poz 'C\[Pa[^]]+' file

dedec0 · 03-23-2018, 04:51 PM

Quote:

Originally Posted by keefaz

For grepping multiple lines, maybe try
-P (perlre)
-z (lines terminated with NULL instead of newline)
-o (print only matched)

Code:

grep -Poz 'C\[Pa[^]]+' file

It is not absolutely clear to me, please explain:

- Why '-z' option is there? Due the "any encoding" chars?

- Why '-o' option is (or might be) relevant for what I need. Maybe I did not imagine that possibility for my problem files.

From the last things I tried, and what you suggest, I did:

Quote:

grep -Poz 'C\[[PA][av][[:print:]\n]\+\]' file

That did not work. Empty result! Maybe I need to fix that regex as a Perl expression. May you help me with it?

One more try:

Code:

grep -oz 'C\[[PA][av][[:print:]\n]\+\]' f

This command has an idea error: '[[:print:]\n]\+' grows too much, passing the closing bracket that ends the big expression, going to the last one in the file (I think, but at least more than should, as I have seen in my test).

A simple question about this first try: does it consider only chars in the locale where I run the command? May it fail in valid but "bad for that" files?

rknichols · 03-23-2018, 05:34 PM

Quote:

Originally Posted by dedec0

Reading '$ man grep # GNU grep 2.27', there is no occurrence of the word dot! A bit strange. But there is a section "Character Classes and Bracket Expressions", which mentions [:alpha:] and others, and also says that "Most meta-characters lose their special meaning inside bracket expressions". I wanted to know which of them does and does not! I would say to other people look in the man page for that...

The only meta-characters which keep their special meaning are those mentioned in the three sentences that follow in that paragraph:

"To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last."

All other characters, and that includes backslash and ".", are taken literally.

syg00 · 03-23-2018, 05:39 PM

This has wandered around, but to answer the main question, you cannot grep for \n.
Being a stream, the newline is stripped prior to you getting the record(s). You can imply the location of the newline, and reinsert it in need (common with sed). perl has the option to retain the newline in input, effectively slurping the entire file as a single record.

rknichols · 03-23-2018, 06:03 PM

Quote:

Originally Posted by syg00

This has wandered around, but to answer the main question, you cannot grep for \n.
Being a stream, the newline is stripped prior to you getting the record(s).

You can do it with the "-z" option:

"-z, --null-data
Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline."

There doesn't seem to be any way to include a literal newline within a bracket expression, but a "." (outside of a bracket expression) or a literal newline (in a single-quoted string to get it past the shell) will match a newline in the input.

Trying to test that is really hard.

syg00 · 03-23-2018, 06:25 PM

Especially with binary - possibility of lots of nulls.

AwesomeMachine · 03-23-2018, 08:02 PM

Grep doesn't work with every possible collection of bytes. It only works with certain characters and nonprinting characters. If the file contains binary data you can convert it to text, but the result would be meaningless. If you want to find everything grep will find besides ']', then you would

Code:

$ grep -v '\]' file

dedec0 · 03-23-2018, 08:38 PM

Quote:

Originally Posted by AwesomeMachine

Grep doesn't work with every possible collection of bytes. It only works with certain characters and nonprinting characters. If the file contains binary data you can convert it to text, but the result would be meaningless. If you want to find everything grep will find besides ']', then you would

Code:

$ grep -v '\]' file

"All possible bytes" may be something hard. The files are SGF files for weiqi games. But the move comments inside them (due how they are generated) may have any encoding. I do not have the clueslest guess of which bytes this situation discards or not.

Finding with -v (which inverts match)... nice idea. But I cannot use only that. I need to find the start for the specific ']' found (which is the "C[" start from my tries' longest regexes). Two piped greps? I will think about this idea for sometime...

dedec0 · 03-23-2018, 08:43 PM

Quote:

Originally Posted by syg00

Especially with binary - possibility of lots of nulls.

I know in normal binary files, lots of zero bytes are easily found. But that may not be the case, I think. The files are SGF, as I said (with more details) in the previous post.

AwesomeMachine · 03-23-2018, 09:01 PM

OK, I looked at the SGF format. There is no binary data used in them. There are nonprinting characters, which are not the same as binary data. I suggest you just tell the community exactly what you want to do, post part of the file inside quote tags, and let everyone look at what you have. I'm rest assured grep is not the tool to use for your project.