Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a file in iso-latin1 encoding (no doubt with that!) and a shell script that needs to grep lines from such a file.
I have read about grep --binary. I have also read about doing grep "\x##". But it does not work for me. Check the test I have built now and a few comments I make here about some steps.
Below are commands and their output copied from my terminal. Lines starting with $ are commands, others are output:
Code:
$ # an iso-latin1 file with a few lines and two chars in this encoding:
$ # ó (0xf3) and ú (0xfa)
$ cat arq
<td align=left nowrap><a href="...">
aaóaa grep this line
</a>
só do not grep this line
número neither this
<br>
$ # the file *is* in iso-latin1 encoding. There is no doubt!
$ # hd and hexdump are the same file, just different output
$ hd arq # it shows an hexdump with a bit better default than command 'hexdump'
00000000 3c 74 64 20 61 6c 69 67 6e 3d 6c 65 66 74 20 6e |<td align=left n|
00000010 6f 77 72 61 70 3e 3c 61 20 68 72 65 66 3d 22 2e |owrap><a href=".|
00000020 2e 2e 22 3e 0a 61 61 f3 61 61 20 67 72 65 70 20 |..">.aa.aa grep |
00000030 74 68 69 73 20 6c 69 6e 65 0a 3c 2f 61 3e 0a 73 |this line.</a>.s|
00000040 f3 20 64 6f 20 6e 6f 74 20 67 72 65 70 20 74 68 |. do not grep th|
00000050 69 73 20 6c 69 6e 65 0a 6e fa 6d 65 72 6f 20 6e |is line.n.mero n|
00000060 65 69 74 68 65 72 20 74 68 69 73 0a 3c 62 72 3e |either this.<br>|
00000070 0a |.|
00000071
$ # The terminal I am using is UTF-8, no doubt on that too.
$ # This should output the wanted line with string "aaóaa"
$ cat arq |grep --binary "aa\xf3aa"
$ # No output! I need to type "ó" in iso-latin1 encoding! Grep doesn't have it? Just with Perl regex?? :-/
$ cat arq |grep --binary "aaóaa" # do not work because terminal is UTF-8? Seems so.
$
$ # The "ó" is not correctly shown for this command because it spits an iso-latin1 char in UTF-8 term
$ cat arq |grep --binary -P "aa\xf3aa"
aa�aa grep this line
$
I have seen a question in stackoverflow about this, but the checked answer uses --binary together with --text, which seems nonsense to me.
The \x notation is not present in grep regexes? The -P/--perl-regexp is also used... if a regex can be written without the -P flag, it solves my problem here!
No way but -P?? I am disappointed, if that is true.
which seems to be the key to this - whatever it means.
if i were you i'd start with
Code:
man grep
:-/ You assumed that I did not try anything before making this thread, which is wrong.
Option by option:
Code:
grep -o # Return only matching part. This makes no difference for my test, removed
grep -b # same as --byte-offset (show byte offset of match before it); also removed
grep -U # same as --binary, but much less readable
grep -a # same as --text - nonsense to me, since -U is given
grep -P # as said above, same as --perl-regexp
# What are we left with? My question and my test, right?
:-/ You assumed that I did not try anything before making this thread, which is wrong.
We cannot know what you've done/tried, unless you actually tell us what you did. You only said "Check the test I have built"...but don't tell us what that test is, only hint at it. The sample you provided doesn't show the character (at least not on my screen), that you say it should look for, and the grep you posted doesn't have any of the options...
Quote:
Option by option:
Code:
grep -o # Return only matching part. This makes not difference for my test, removed
grep -b # same as --byte-offset (show byte offset of match before it); also removed
grep -U # same as --binary, but much less readable
grep -a # same as --text - nonsense to me, since -U is given
grep -P # as said above, same as --perl-regexp
What are we left with? My question and my test, right?
...you posted here. Read the "Question Guidelines" link in my posting signature. Please post a valid sample of the file you're looking at (please, not a copy/paste of a command-output...the actual line(s) from the file), along with what you're typing in to look for it.
We cannot know what you've done/tried, unless you actually tell us what you did. You only said "Check the test I have built"...but don't tell us what that test is, only hint at it. The sample you provided doesn't show the character (at least not on my screen), that you say it should look for, and the grep you posted doesn't have any of the options...
...you posted here. Read the "Question Guidelines" link in my posting signature. Please post a valid sample of the file you're looking at (please, not a copy/paste of a command-output...the actual line(s) from the file), along with what you're typing in to look for it.
I am sorry, there is a misunderstanding here. I fully showed the test, it is just a few commands. I forgot to mention that the code I put on the first post was several lines copied from my terminal. This was "naturally obvious" to me, I had just used [codes] for terminal lines in a few posts I wrote just before.
I have edited my first post to fix this detail and added a few more comments about the test. Is it better now?
If you want simply to know if the character is present, use grep's -c option.
Ermmm... xxd? :-/ It can do the reverse way, but I need to have the full line where that "ó" appears (and a few more chars, it is a string in iso-latin1). Is that possible? I cannot do that with xxd. Can you?
The full story is:
- I download an HTML page which is encoded in iso-latin1
- I want to process and get some information in a few lines of it
- I tried to use the magic . and the apparently working [^o] (that should be anything but the normal "o")... both did not work like I imagined (or wanted).
My basic pipe commands in this step of the whole script are:
I assign this pipe to a variable (backquotes) and process it. The difficulty is the latin1 char for the first grep. Isn't there another way but the "Perl highly experimental regex" of "\x##"? No normal regex for that in grep? The manpage says
Quote:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly
experimental and grep -P may warn of unimplemented features.
which made me think of something else existing. But I could not find that.
It is really not clear what is in your file, or what you expect to get in your grep results.
Here is my best guess, but let us know if this does not lead to the desired result.
First, what is actually in your file? You have posted a hexdump, but it is not clear whether you want to grep the file, or the hexdump of the file. (Did you hexdump this on a Windows machine or Linux? "hd" is not a utility on my Linux system.)
Reversing the hexdump, I would guess the file contents look like this...
Code:
<td align=left nowrap><a href="...">
aaóaa grep this
</a>
só grep this too
número
<br>
Now, if you want to grep this file, grep can handle the unicode itself, so something like this will work"
Code:
grep '[óú]' example.bin
aaóaa grep this
só grep this too
número
If you actually want to grep the hexdump of that file you need to make sure that the character encodings as stored on disk are as expected (system dependent!). On my system, the two characters are identified as f3 and fa digraphs in Vim as expected, but the binary unicode representation on disk (hex and byte order) appear as follows:
Code:
0000000: 20 20 3c 74 64 20 61 6c 69 67 6e 3d 6c 65 66 74 <td align=left
0000010: 20 6e 6f 77 72 61 70 3e 3c 61 20 68 72 65 66 3d nowrap><a href=
0000020: 22 2e 2e 2e 22 3e 0a 20 20 61 61 c3 b3 61 61 20 "...">. aa..aa
0000030: 67 72 65 70 20 74 68 69 73 0a 20 20 3c 2f 61 3e grep this. </a>
0000040: 0a 20 20 73 c3 b3 20 67 72 65 70 20 74 68 69 73 . s.. grep this
0000050: 20 74 6f 6f 0a 20 20 6e c3 ba 6d 65 72 6f 0a 20 too. n..mero.
0000060: 20 3c 62 72 3e 0a <br>.
The unicode representation will always be multi-byte and the two high bits of the (system-dependent ordering) "first" byte of a two-byte character (which these are) will always be set, or 'c', hence c3b3 and c3ba.
So, to grep the hex output itself I would do something like...
Code:
xxd -g 1 example.bin |grep 'c3'
0000020: 22 2e 2e 2e 22 3e 0a 20 20 61 61 c3 b3 61 61 20 "...">. aa..aa
0000040: 0a 20 20 73 c3 b3 20 67 72 65 70 20 74 68 69 73 . s.. grep this
0000050: 20 74 6f 6f 0a 20 20 6e c3 ba 6d 65 72 6f 0a 20 too. n..mero.
Now, it has been a while since I worked fully through the twists of Unicode character storage, so take the above as direction, maybe not destination.
But using the above as guide, see if you can describe for us more completely what you are trying to do, the contents of the file you are working on, on what platform (important) and what you expect as a result.
You updated your info as was typing my earlier post, so a quick catch-up...
My concerns with Unicode were that you stated your terminal was UTF-8 and I was not confident that the hexdump was actually the data you were trying to grep. The 'hd' utility for hexdump appears to be for Windows, RedHat up to 7.3 and old SCO machines, so I was concerned that it might be translating between iso-latin1 and/or Windows codepages and/or UTF-8, and/or... just a confidence thing. But I think not a cncern now.
You are now very clear (I think) that the file is in fact iso-latin1, and that what you want is the actual text lines, not the hex dumped lines.
As syg00 says, grep -P is up to the task. I don't think the binary flag is necessary, but use -a if -P alone doesn't work on your system. (By the way, what is the actual system you are doing this on?).
Here is my own working test case. The file data in iso-latin1 text (with UTF-8 substitutions for this post) and the hexdump are...
Code:
cat iso-latin1.bin
<td align=left nowrap><a href="...">
aaóaa grep this
</a>
só grep this too
número
<br>
xxd -g 1 iso-latin1.bin
0000000: 20 20 3c 74 64 20 61 6c 69 67 6e 3d 6c 65 66 74 <td align=left
0000010: 20 6e 6f 77 72 61 70 3e 3c 61 20 68 72 65 66 3d nowrap><a href=
0000020: 22 2e 2e 2e 22 3e 0a 20 20 61 61 f3 61 61 20 67 "...">. aa.aa g
0000030: 72 65 70 20 74 68 69 73 0a 20 20 3c 2f 61 3e 0a rep this. </a>.
0000040: 20 20 73 f3 20 67 72 65 70 20 74 68 69 73 20 74 s. grep this t
0000050: 6f 6f 0a 20 20 6e fa 6d 65 72 6f 0a 20 20 3c 62 oo. n.mero. <b
0000060: 72 3e 0a r>.
And the grep (without UTF-8 substitutions):
Code:
grep -P '\xf3|\xfa' iso-latin1.bin
aaa grep this
sgrep this too
número
On my UTF-8 locale it appears the characters are changed, as it may on yours, but they are fine (cat used only to match your case)...
Send that down your pipeline and you should be OK.
The only additional caution I would offer is that if you open the file in Vi on a UTF-8 enabled machine, it will want to automatically convert those characters to Unicode by default. To prevent that, open with the -b option. Other utilities in the pipeline "may" want to make them Unicode on a UTF-8 locale, as indicated by your UTF-8 terminal comment.
There have been three replies since my last post. I will try to make all relevant or needed comments in this one, now. Reading these replies in the order they appear.
Quote:
astrogeek: It is really not clear what is in your file, or what you expect to get in your grep results.
I imagined that 'hd' is something common. 'man hd' shows me a page that starts with
Quote:
HEXDUMP(1) BSD General Commands Manual HEXDUMP(1)
NAME
hexdump, hd — ASCII, decimal, hexadecimal, octal dump
I have known and used it for years. My OS now is an old Ubuntu 10.04. I have not installed hd, I guess it exists in all "debianesques" distributions.
Curious to remember (or to find) why I prefer to use hd, I tested both: the output is different! Check it here. I have done these commands:
Now I added a comment about 'hexdump' X 'hd' in the first post. The difference in them is their default output, but some people did not have hd in their system, which is a surprise for me. It existed in all linuxes I have tried to use it (since many years ago!). I have also added in the first post the normal file dump (But be careful! Copying it to save in a test file may lead to different byte values for the special chars).
No way but -P and "\x##" ?? I am disappointed, if that is true.
Quote:
Originally Posted by astrogeek
Now, if you want to grep this file, grep can handle the unicode itself, so something like this will work"
Code:
grep '[óú]' example.bin
aaóaa grep this
só grep this too
número
No, this is wrong. It does *not* work unless the terminal has the same encoding of the file. I have added another grep command to show this in the first post.
@astrogeek: you grep'ed from the hexdump just the byte for "ó" in iso-latin1. I need to grep a full string, and I want the whole line where that string is to the next command on the pipeline. I cannot imagine a way to do this with hd/hexdump/xxd. You ask me to show a more complete situation. I answer to you: it is not necessary. The complete file would be pointless big to put here. The only problem I have with that script is: terminal in UTF-8 (my choice); file in iso-latin1 (server choice); shell script written by me, being created with test commands in the terminal. And maybe the problem can be reduced to how to use (a form of) "\x##" without the -P grep flag.
Quote:
Originally Posted by syg00
The perl regex support in grep is fine for your needs. All you need is
Code:
cat $arq |grep -aP "\xf3"
Is it the *only* way?
-------------------------------------
@astrogeek for the #10 post: Does not hurt to repeat yet again: terminal in UTF-8, file in iso-latin1. You also did the test of:
This test works because cat does *not* change any byte before putting it to the pipe! See:
Code:
$ cat iso-latin1.file |grep -P '\xf3|\xfa' > out
$ hd out # or hexdump! The important here are the "ó" and "ú" latin1 bytes
00000000 61 61 f3 61 61 20 67 72 65 70 20 74 68 69 73 20 |aa.aa grep this |
00000010 6c 69 6e 65 0a 73 f3 20 64 6f 20 6e 6f 74 20 67 |line.s. do not g|
00000020 72 65 70 20 74 68 69 73 20 6c 69 6e 65 0a 6e fa |rep this line.n.|
00000030 6d 65 72 6f 20 6e 65 69 74 68 65 72 20 74 68 69 |mero neither thi|
00000040 73 0a |s.|
00000042
$ # Remember: term UTF-8, file in iso-latin1! This justifies the "?" chars here instead of [óú] :
$ cat out
aa�aa grep this line
s� do not grep this line
n�mero neither this
At last, to the #10 post post of astrogeek, I am used to editing files in different encodings, different linebreaks (Win, *nix, ...) and probably different byte orders (for multibyte chars in Unicode encodings). Vim (not vi!) is my favorite editor, and it deals gracefully with such situations. (:
To end this post, yet another repetition of the main detail of my problem, as it looks now:
Is there a way to pass byte values to grep without that experimental -P flag? There are 3 other matchers to choose! :-/
No, this is wrong. It does *not* work unless the terminal has the same encoding of the file. I have added another grep command to show this in the first post.
based on man grep:
Code:
The locale for category LC_foo is specified by examining the three environment variables LC_ALL, LC_foo, LANG, in that order.
The first of these variables that is set specifies the locale.
For example, if LC_ALL is not set, but LC_MESSAGES is set to pt_BR, then the Brazilian Portuguese locale is used for the LC_MESSAGES category.
The C locale is used if none of these environment variables are set, if the locale catalog is not installed, or if grep was not compiled with national language support ( NLS ).
So this is the way to specify locale, nothing else. So I do not really understand why do you really need another solution. Furthermore:
About the previous post, please do not put copied man pages in code tags. They will not break the lines and make them hardly readable. Please edit it?
Use a font like "Courier New", instead:
The locale for category LC_foo is specified by examining the three environment variables LC_ALL, LC_foo, LANG, in that order.
The first of these variables that is set specifies the locale.
For example, if LC_ALL is not set, but LC_MESSAGES is set to pt_BR, then the Brazilian Portuguese locale is used for the LC_MESSAGES category.
The C locale is used if none of these environment variables are set, if the locale catalog is not installed, or if grep was not compiled with national language support ( NLS ).
Now we can read it without any horizontal scrolling. (:
The locale for category LC_foo is specified by examining the three environment variables LC_ALL, LC_foo, LANG, in that order.
The first of these variables that is set specifies the locale.
For example, if LC_ALL is not set, but LC_MESSAGES is set to pt_BR, then the Brazilian Portuguese locale is used for the LC_MESSAGES category.
The C locale is used if none of these environment variables are set, if the locale catalog is not installed, or if grep was not compiled with national language support ( NLS ).
So this is the way to specify locale, nothing else. So I do not really understand why do you really need another solution. Furthermore:
if you know you need a special local, you can use:
Code:
LC_ALL=my_locale grep pattern file
I do not understand why you are talking about setting locale and the related variables. For grep, LC_MESSAGES wouldn't affect only the messages it prints? I think it is not related to the encoding of files and how they are interpreted by grep (and most other programs) or put on its standard output.
I am using an UTF-8 terminal.
I have only the LC_CTYPE=C variable. My $LANG = pt_BR.utf8.
And I need to grep iso-latin1 chars from an iso-latin1 encoded file.
And I want to avoid using -P switch, if that is possible.
And you meant "LC_ALL=my_locale; grep pattern file" instead of "LC_ALL=my_locale grep pattern file" ? But this is out of my problem, I think.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.