special character not displayed correctly by cat, more
We recently switched from Solaris to Suse Linux and I am having problems with a special character
(alt+0167 or §) used as a delimiter in a file. (I am doing this using X11 running under Hummingbird). Cat and more display it as an oval character with a black question mark inside, whereas vi displays it correctly (and allows me to edit it). I tried using cut -d'§' -f1 to cut out the first field, but no go. Same thing with sed. I have tried changing the locale settings and different fonts, also no luck. Anybody have any ideas where the difference is between vi and cat/more or what I can try to get cat/more to display the character correctly (Xresources or profile maybe)? TIA for suggestions. |
What is your main locale? That character § (0xa7) displays properly in cat and in the command line on my system (using gnome-terminal ... and in xterm).
|
What's the encoding of the file? Linux uses UTF-8 almost exclusively. If the file is in a different encoding, you're likely to have trouble with non-ascii stuff.
Edit: BTW, I believe cut can only deal with ascii characters as delimiters. At least I get an error when I try it with anything else. |
my locale variables are all set to en_US.UTF-8. I tried some other values, however no luck. Any idea
what makes VI work and the rest of the utilities fail? Colleagues using putty (without xterm) have no problem with this. On Solaris - and I'm sure this holds for Linux - cut can handle the special chars. Even if cut could not do so, I would expect cat and more to at least display the character properly. Any ideas how I can figure out why VI works and cat does not? |
As I said before, I'm guessing the problem is with the encoding of the file. Most text editors have features to auto-detect the encoding, but the shell display is UTF-8 and most of the simpler tools are generally UTF-8 or ascii only. If your file isn't UTF-8, then it won't display properly in the terminal.
Code:
$ echo "foo§bar" > file.txt Edit: The cut info page lists this (non-functional) option: Code:
`-n' Do not split multi-byte characters (no-op for now). |
Looks like you hit the nail on the head David. I entered the commands you listed and I've got the exact
situation I described at the beginning. The files are being produced by a perl script which connects to a MySQL database and selects data. And the MySQL database - and the resulting files - are indeed ISO-8859 in format. I'm still not clear on the non-ascii characters in cut, when a problem will occur. I entered the following: >echo "aaaa§bbb" > bum >cat bum aaaa§bbb >cut -d'§' -f1 bum aaaa > This was with my locale set to UTF-8. I then changed it to en_US.ISO8859-1, same results. So I'm not really sure under what circumstances I would encounter the problem you described, however I am aware of it and know where to look in the future. My thanks to both of you for your responses and a quick resolution of the problem (actually I still have to resolve the problem in the file creation process, but at least I know where it is coming from). |
Glad you got it worked out.
I finally got cut to accept the "§" symbol when I changed both my locale and console encoding setting to ISO-8859-1. Otherwise it wouldn't work for me. One thing you should realize is that ISO-8859 is (are) an 8-bit encoding(s), which means that every supported character is one byte. But UTF-8 is a variable-byte encoding. The one-byte layer is identical to ascii, but as you go up the unicode chart, characters and symbols occupy 2, 3, or even 4 bytes. Most symbols like "§" are 2-byte characters in UTF-8. As to why it's working for you, I can't say. I'm only going by the error message I get when I try it. Perhaps your version is newer and has support included, or there's some subtle locale support difference or something. |
All times are GMT -5. The time now is 12:08 AM. |