LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-20-2011, 03:54 PM   #1
paisano
LQ Newbie
 
Registered: Oct 2011
Posts: 3

Rep: Reputation: Disabled
special character not displayed correctly by cat, more


We recently switched from Solaris to Suse Linux and I am having problems with a special character
(alt+0167 or §) used as a delimiter in a file. (I am doing this using X11 running under
Hummingbird).

Cat and more display it as an oval character with a black question mark inside, whereas vi displays
it correctly (and allows me to edit it). I tried using cut -d'§' -f1 to cut out the first field, but
no go. Same thing with sed. I have tried changing the locale settings and different fonts, also no
luck. Anybody have any ideas where the difference is between vi and cat/more or what I can try to get
cat/more to display the character correctly (Xresources or profile maybe)? TIA for suggestions.

Last edited by paisano; 10-20-2011 at 03:56 PM. Reason: reformatted lines
 
Old 10-20-2011, 04:30 PM   #2
SecretCode
Member
 
Registered: Apr 2011
Location: UK
Distribution: Kubuntu 11.10
Posts: 562

Rep: Reputation: 102Reputation: 102
What is your main locale? That character § (0xa7) displays properly in cat and in the command line on my system (using gnome-terminal ... and in xterm).
 
Old 10-21-2011, 02:42 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
What's the encoding of the file? Linux uses UTF-8 almost exclusively. If the file is in a different encoding, you're likely to have trouble with non-ascii stuff.

Edit: BTW, I believe cut can only deal with ascii characters as delimiters. At least I get an error when I try it with anything else.

Last edited by David the H.; 10-21-2011 at 02:45 AM.
 
Old 10-21-2011, 05:37 PM   #4
paisano
LQ Newbie
 
Registered: Oct 2011
Posts: 3

Original Poster
Rep: Reputation: Disabled
my locale variables are all set to en_US.UTF-8. I tried some other values, however no luck. Any idea
what makes VI work and the rest of the utilities fail? Colleagues using putty (without xterm) have no problem
with this. On Solaris - and I'm sure this holds for Linux - cut can handle the special chars. Even if cut
could not do so, I would expect cat and more to at least display the character properly. Any ideas how I can
figure out why VI works and cat does not?
 
Old 10-22-2011, 10:06 AM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
As I said before, I'm guessing the problem is with the encoding of the file. Most text editors have features to auto-detect the encoding, but the shell display is UTF-8 and most of the simpler tools are generally UTF-8 or ascii only. If your file isn't UTF-8, then it won't display properly in the terminal.

Code:
$ echo "foo§bar" > file.txt

$ file file.txt
file.txt: UTF-8 Unicode text

$ cat file.txt
foo§bar

$ iconv -f UTF-8 -t ISO-8859-1 file.txt > file2.txt

$ file file2.txt
file2.txt: ISO-8859 text

$ cat file2.txt
foo�bar
And I said that cut can't handle them as delimiters, not that it couldn't process text containing them. Whenever I try to use anything other than ascii in the -d option, I get an error saying that "the delimiter must be a single character", which demonstrates that it won't accept multi-byte delimiting characters.

Edit: The cut info page lists this (non-functional) option:
Code:
`-n'   Do not split multi-byte characters (no-op for now).
This tells me that it's not currently able to distinguish between single- and multi-byte characters, but that they intend to include that feature in the future.

Last edited by David the H.; 10-22-2011 at 12:09 PM. Reason: 1) as stated 2) rewording for clarity
 
Old 10-25-2011, 02:40 AM   #6
paisano
LQ Newbie
 
Registered: Oct 2011
Posts: 3

Original Poster
Rep: Reputation: Disabled
Looks like you hit the nail on the head David. I entered the commands you listed and I've got the exact
situation I described at the beginning. The files are being produced by a perl script which connects to
a MySQL database and selects data. And the MySQL database - and the resulting files - are indeed ISO-8859
in format.

I'm still not clear on the non-ascii characters in cut, when a problem will occur. I entered the following:

>echo "aaaa§bbb" > bum
>cat bum
aaaa§bbb
>cut -d'§' -f1 bum
aaaa
>

This was with my locale set to UTF-8. I then changed it to en_US.ISO8859-1, same results. So I'm not really
sure under what circumstances I would encounter the problem you described, however I am aware of it and know
where to look in the future.

My thanks to both of you for your responses and a quick resolution of the problem (actually I still have to
resolve the problem in the file creation process, but at least I know where it is coming from).
 
Old 10-25-2011, 10:58 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Glad you got it worked out.

I finally got cut to accept the "§" symbol when I changed both my locale and console encoding setting to ISO-8859-1. Otherwise it wouldn't work for me.

One thing you should realize is that ISO-8859 is (are) an 8-bit encoding(s), which means that every supported character is one byte. But UTF-8 is a variable-byte encoding. The one-byte layer is identical to ascii, but as you go up the unicode chart, characters and symbols occupy 2, 3, or even 4 bytes. Most symbols like "§" are 2-byte characters in UTF-8.

As to why it's working for you, I can't say. I'm only going by the error message I get when I try it. Perhaps your version is newer and has support included, or there's some subtle locale support difference or something.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
block special and character special files s_shenbaga Linux - Newbie 4 06-23-2015 03:16 AM
Special characters are not displayed MasterOfTheWind Arch 2 07-01-2006 03:31 AM
Fonts not displayed correctly using sudo ahh Linux - General 3 02-04-2006 06:46 AM
Character Set displayed problem. chrislee8 Linux - Newbie 3 10-02-2004 01:28 PM
CD-Rom ,filenames are not displayed correctly Darklegion Linux - General 4 08-28-2004 03:46 AM


All times are GMT -5. The time now is 08:35 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration