LinuxQuestions.org - [SOLVED] special character not displayed correctly by cat, more

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - special character not displayed correctly by cat, more (https://www.linuxquestions.org/questions/linux-newbie-8/special-character-not-displayed-correctly-by-cat-more-909198/)

special character not displayed correctly by cat, more

We recently switched from Solaris to Suse Linux and I am having problems with a special character
(alt+0167 or §) used as a delimiter in a file. (I am doing this using X11 running under
Hummingbird).

Cat and more display it as an oval character with a black question mark inside, whereas vi displays
it correctly (and allows me to edit it). I tried using cut -d'§' -f1 to cut out the first field, but
no go. Same thing with sed. I have tried changing the locale settings and different fonts, also no
luck. Anybody have any ideas where the difference is between vi and cat/more or what I can try to get
cat/more to display the character correctly (Xresources or profile maybe)? TIA for suggestions.

What is your main locale? That character § (0xa7) displays properly in cat and in the command line on my system (using gnome-terminal ... and in xterm).

What's the encoding of the file? Linux uses UTF-8 almost exclusively. If the file is in a different encoding, you're likely to have trouble with non-ascii stuff.

Edit: BTW, I believe cut can only deal with ascii characters as delimiters. At least I get an error when I try it with anything else.

my locale variables are all set to en_US.UTF-8. I tried some other values, however no luck. Any idea
what makes VI work and the rest of the utilities fail? Colleagues using putty (without xterm) have no problem
with this. On Solaris - and I'm sure this holds for Linux - cut can handle the special chars. Even if cut
could not do so, I would expect cat and more to at least display the character properly. Any ideas how I can
figure out why VI works and cat does not?

As I said before, I'm guessing the problem is with the encoding of the file. Most text editors have features to auto-detect the encoding, but the shell display is UTF-8 and most of the simpler tools are generally UTF-8 or ascii only. If your file isn't UTF-8, then it won't display properly in the terminal.

Code:

$ echo "foo§bar" > file.txt



$ file file.txt

file.txt: UTF-8 Unicode text



$ cat file.txt

foo§bar



$ iconv -f UTF-8 -t ISO-8859-1 file.txt > file2.txt



$ file file2.txt

file2.txt: ISO-8859 text



$ cat file2.txt

foo�bar

And I said that cut can't handle them as delimiters, not that it couldn't process text containing them. Whenever I try to use anything other than ascii in the -d option, I get an error saying that "the delimiter must be a single character", which demonstrates that it won't accept multi-byte delimiting characters.

Edit: The cut info page lists this (non-functional) option:

Code:

`-n' Do not split multi-byte characters (no-op for now).

This tells me that it's not currently able to distinguish between single- and multi-byte characters, but that they intend to include that feature in the future.

Looks like you hit the nail on the head David. I entered the commands you listed and I've got the exact
situation I described at the beginning. The files are being produced by a perl script which connects to
a MySQL database and selects data. And the MySQL database - and the resulting files - are indeed ISO-8859
in format.

I'm still not clear on the non-ascii characters in cut, when a problem will occur. I entered the following:

>echo "aaaa§bbb" > bum
>cat bum
aaaa§bbb
>cut -d'§' -f1 bum
aaaa
>

This was with my locale set to UTF-8. I then changed it to en_US.ISO8859-1, same results. So I'm not really
sure under what circumstances I would encounter the problem you described, however I am aware of it and know
where to look in the future.

My thanks to both of you for your responses and a quick resolution of the problem (actually I still have to
resolve the problem in the file creation process, but at least I know where it is coming from).

Glad you got it worked out.

I finally got cut to accept the "§" symbol when I changed both my locale and console encoding setting to ISO-8859-1. Otherwise it wouldn't work for me.

One thing you should realize is that ISO-8859 is (are) an 8-bit encoding(s), which means that every supported character is one byte. But UTF-8 is a variable-byte encoding. The one-byte layer is identical to ascii, but as you go up the unicode chart, characters and symbols occupy 2, 3, or even 4 bytes. Most symbols like "§" are 2-byte characters in UTF-8.

As to why it's working for you, I can't say. I'm only going by the error message I get when I try it. Perhaps your version is newer and has support included, or there's some subtle locale support difference or something.