regex (for grep): a dot inside a bracket expression, and a few more questions
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
regex (for grep): a dot inside a bracket expression, and a few more questions
Hello (:
Although I am familiar with fairly complex regexes, I have a little doubt now: if I use a dot inside a bracket expression, may it mean "most chars", as it does in normal places?
Reading '$ man grep # GNU grep 2.27', there is no occurrence of the word dot! A bit strange. But there is a section "Character Classes and Bracket Expressions", which mentions [:alpha:] and others, and also says that "Most meta-characters lose their special meaning inside bracket expressions". I wanted to know which of them does and does not! I would say to other people look in the man page for that...
Further, I want to match everything but a closing bracket, *including* newlines! How to do that in file that possibly includes binary data before that "]"?
One of my first good tries is:
Code:
$ grep 'C\[[PA][av][^]]\+\]' file # does *not* grep newlines
Please point: I am unsure if I need to use '-a' with this kind of file, that may contain binary data (or chars in any encoding) mixed with the brackets and ASCII chars used (coarsely saying here) to make a set of it.
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524
Rep:
Grep has a switch to treat as a binary file. If you want to give special characters their special meaning inside brackets, escape them with '\'. I think newline is '\n'.
For the kind of file I said, a text file with mixed chars in any encoding, should I use -a?
Quote:
Originally Posted by AwesomeMachine
If you want to give special characters their special meaning inside brackets, escape them with '\'. I think newline is '\n'.
My question was:
Quote:
Further, I want to match everything but a closing bracket, *including* newlines! How to do that in file that possibly includes binary data before that "]"?
Do not miss that the "everything" I wrote includes chars in any encoding! So writing all of them is not an option. My try for that regex part is:
Code:
[^]]
But it does *not* grep this wanted part of my files:
Code:
C[Palkajsdalsk
laskjdasld
dlaksjdsaldjas]
If any of you prefer to discard the try I showed and give me a completely new one, there is no problem with that.
For grepping multiple lines, maybe try
-P (perlre)
-z (lines terminated with NULL instead of newline)
-o (print only matched)
Code:
grep -Poz 'C\[Pa[^]]+' file
It is not absolutely clear to me, please explain:
- Why '-z' option is there? Due the "any encoding" chars?
- Why '-o' option is (or might be) relevant for what I need. Maybe I did not imagine that possibility for my problem files.
From the last things I tried, and what you suggest, I did:
Quote:
grep -Poz 'C\[[PA][av][[:print:]\n]\+\]' file
That did not work. Empty result! Maybe I need to fix that regex as a Perl expression. May you help me with it?
One more try:
Code:
grep -oz 'C\[[PA][av][[:print:]\n]\+\]' f
This command has an idea error: '[[:print:]\n]\+' grows too much, passing the closing bracket that ends the big expression, going to the last one in the file (I think, but at least more than should, as I have seen in my test).
A simple question about this first try: does it consider only chars in the locale where I run the command? May it fail in valid but "bad for that" files?
Reading '$ man grep # GNU grep 2.27', there is no occurrence of the word dot! A bit strange. But there is a section "Character Classes and Bracket Expressions", which mentions [:alpha:] and others, and also says that "Most meta-characters lose their special meaning inside bracket expressions". I wanted to know which of them does and does not! I would say to other people look in the man page for that...
The only meta-characters which keep their special meaning are those mentioned in the three sentences that follow in that paragraph:
"To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last."
All other characters, and that includes backslash and ".", are taken literally.
This has wandered around, but to answer the main question, you cannot grep for \n.
Being a stream, the newline is stripped prior to you getting the record(s). You can imply the location of the newline, and reinsert it in need (common with sed). perl has the option to retain the newline in input, effectively slurping the entire file as a single record.
This has wandered around, but to answer the main question, you cannot grep for \n.
Being a stream, the newline is stripped prior to you getting the record(s).
You can do it with the "-z" option:
"-z, --null-data
Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline."
There doesn't seem to be any way to include a literal newline within a bracket expression, but a "." (outside of a bracket expression) or a literal newline (in a single-quoted string to get it past the shell) will match a newline in the input.
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524
Rep:
Grep doesn't work with every possible collection of bytes. It only works with certain characters and nonprinting characters. If the file contains binary data you can convert it to text, but the result would be meaningless. If you want to find everything grep will find besides ']', then you would
Grep doesn't work with every possible collection of bytes. It only works with certain characters and nonprinting characters. If the file contains binary data you can convert it to text, but the result would be meaningless. If you want to find everything grep will find besides ']', then you would
Code:
$ grep -v '\]' file
"All possible bytes" may be something hard. The files are SGF files for weiqi games. But the move comments inside them (due how they are generated) may have any encoding. I do not have the clueslest guess of which bytes this situation discards or not.
Finding with -v (which inverts match)... nice idea. But I cannot use only that. I need to find the start for the specific ']' found (which is the "C[" start from my tries' longest regexes). Two piped greps? I will think about this idea for sometime...
Especially with binary - possibility of lots of nulls.
I know in normal binary files, lots of zero bytes are easily found. But that may not be the case, I think. The files are SGF, as I said (with more details) in the previous post.
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524
Rep:
OK, I looked at the SGF format. There is no binary data used in them. There are nonprinting characters, which are not the same as binary data. I suggest you just tell the community exactly what you want to do, post part of the file inside quote tags, and let everyone look at what you have. I'm rest assured grep is not the tool to use for your project.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.