Issue with Spanish Characters

k_kush · 05-18-2012, 04:09 PM

Hi,

So this application loads these files onto a Linux Server. These files have Spanish characters in the name eg : Zimbrão.doc

But when it gets loaded on to the server the Spanish characters get to changed to ascii character - ZimbrÃ£o.doc. And once they retrieve these files from the server they come with the modified names.

How can that be fixed? it looks like the server is unable to understand the character.

Any help will be appreciated.

Thanks.

Doc CPU · 05-18-2012, 04:35 PM

Hi there,

Quote:

Originally Posted by k_kush

So this application loads these files onto a Linux Server.

what application? How are files transferred, using what protocol?

Quote:

Originally Posted by k_kush

These files have Spanish characters in the name eg : Zimbrão.doc

I guess you're using UTF-8 encoding to represent these characters, right?

Quote:

Originally Posted by k_kush

But when it gets loaded on to the server the Spanish characters get to changed to ascii character - ZimbrÃ£o.doc.

That's not ASCII. ASCII doesn't contain these characters, it's a 7bit code.

Quote:

Originally Posted by k_kush

And once they retrieve these files from the server they come with the modified names.

You have a problem with character encoding, that much is sure. But to tell you how to resolve the problem, more background knowledge is required.
Are we talking about a web application that exchanges data using HTTP? Then possibly the server side of the application just fails to specify the correct encoding in the HTTP response headers.

Quote:

Originally Posted by k_kush

it looks like the server is unable to understand the character.

No, probably the server doesn't have to understand the character; it just doesn't tell the client correctly how to interpret the characters.

[X] Doc CPU

k_kush · 05-18-2012, 05:36 PM

Thank You Doc. I will try to answer most of your questions to give you a better picture.

I am new to this company. This is an application which my company uses. I am not too sure about the protocol.

How to identify if I am using the UTF-8 encoding? I tried to recreate the scenario as in. I tried creating a file on another Linux server

touch désolé
this is the result - d?sol?

Yes that much I know it is a web application that exchanges data using HTTP. When you say server side of the application does that mean that the changes have to be done the app code or is it required at the server level?

Ok so the server doesn't have to understand the character, but how to make it interpret the characters?

Doc CPU · 05-19-2012, 04:43 AM

Hi there,

please allow me to put your post in a different order. The conclusion is at the end. ;-)

Quote:

Originally Posted by k_kush

Yes that much I know it is a web application that exchanges data using HTTP. When you say server side of the application does that mean that the changes have to be done the app code or is it required at the server level?

Well, typical web applications consist of some HTML and some program code on the server side, the program parts often done in PHP, ASP.net, rarely Java. This program code produces HTML output that is sent to the client, so that you can use a normal web browser to use that web application.
When I speak about changes to the server side, I mean changes to this PHP or ASP or Java code, whatever it actually is.

Quote:

Originally Posted by k_kush

This is an application which my company uses. I am not too sure about the protocol.

If it's a web application in the usual meaning, it uses HTTP, so that it works with a plain browser.

Quote:

Originally Posted by k_kush

Ok so the server doesn't have to understand the character, but how to make it interpret the characters?
touch désolé
this is the result - d?sol?

Let me strike out a bit further.

[BEGIN: Character encoding basics]
Text can be stored in many different ways; the characters that make up the text can be coded in different ways.
One of the simplest and oldest encodings is ASCII: It uses one byte per character, while the highest bit is unused. So it can store 128 different characters. The first 32 of them are reserved as control characters (like line feed, escape, or end-of-file), that leaves 96 printable characters. These are the 26 letters of the basic English alphabet in upper and lower case, the digits 0..9, and a few essential punctuation marks. That's it. No diacritics, no umlauts, no Greek or Cyrillic letters.
In the late 80's, IT experts began to use the yet unused topmost bit and could now represent 256 different characters with a single byte. They still kept the lower half as it was defined by ASCII, but about the additional 128 characters, there was a mess for many years, because they were assigned differently on almost every computer or software.
In the 90's, a few different specifications came up to standardize the whole set of 256 characters. However, people from different parts of the world, speaking many different languages, each found that a different set of characters was necessary. That's why there are different 8bit encodings. Today, the most important one -at least in the Western world- is the ISO-8859-x family with the -x denoting a few variants, but the majority of characters is the same across all ISO-8859 encodings.

But still, the ISO-8859-x series didn't contain the full set of characters that were used even within Europe, let alone Asia. Hundreds of other characters were actually needed. As a solution to that Babylonian mix of characters sets and encodings that prevailed so far, expert all around the world created the Unicode character set which contains a standardized (and still growing) set of many thousand characters. It's obvious that they cannot be expressed with a single byte any longer, however, the guys made up a clever encoding as a compromise. They invented UTF-8 as one of the standard encodings for Unicode.

UTF-8 represents a character with a variable number of bytes. The 128 ASCII characters are stored as single bytes, as they always were. All characters beyond ASCII need 2, 3 or even 4 bytes. The letter 'ä' for example (used in German) is stored as a two byte sequence 0xC3, 0xA4.

However, if a program has to display text containing this character, it has to know that this sequence is supposed to be one UTF-8 character. If it doesn't, and assumes a traditional 8bit encoding like ISO-8859-1, it wouldn't display "Mädchen" [German "girl"] as intended, but instead "MÃ¤dchen". It would take the two byte sequence as two separate characters. A reader who's familiar with the language can still guess what it means, but it looks garbage. Like the example in your first post.
On the other hand, there are byte sequences that are not valid as a UTF-8 code. If a program expects UTF-8 code and encounters such an invalid sequence, it displays a replacement character, usually a question mark. Like your example above.
[END: Character encoding basics]

Quote:

Originally Posted by k_kush

How to identify if I am using the UTF-8 encoding?

If that web application that you're using is an established standard in your company, you won't want to change anything about it. Or maybe you're not even allowed.
Instead, you have to know what encoding is supposed to be used with this application - there should be people who know, or documentation that tells you about this. Then you have to adapt your own tools to it. Probably your browser, probably your text editor. If all programs involved in the process use the same encoding, there's no problem, even if one might "forget" to tell the others about it.

[X] Doc CPU