Hi there,
please allow me to put your post in a different order. The conclusion is at the end. ;-)
Quote:
Originally Posted by k_kush
Yes that much I know it is a web application that exchanges data using HTTP. When you say server side of the application does that mean that the changes have to be done the app code or is it required at the server level?
|
Well, typical web applications consist of some HTML and some program code on the server side, the program parts often done in PHP, ASP.net, rarely Java. This program code produces HTML output that is sent to the client, so that you can use a normal web browser to use that web application.
When I speak about changes to the server side, I mean changes to this PHP or ASP or Java code, whatever it actually is.
Quote:
Originally Posted by k_kush
This is an application which my company uses. I am not too sure about the protocol.
|
If it's a web application in the usual meaning, it uses HTTP, so that it works with a plain browser.
Quote:
Originally Posted by k_kush
Ok so the server doesn't have to understand the character, but how to make it interpret the characters?
touch désolé
this is the result - d?sol?
|
Let me strike out a bit further.
[BEGIN: Character encoding basics]
Text can be stored in many different ways; the characters that make up the text can be coded in different ways.
One of the simplest and oldest encodings is ASCII: It uses one byte per character, while the highest bit is unused. So it can store 128 different characters. The first 32 of them are reserved as control characters (like line feed, escape, or end-of-file), that leaves 96 printable characters. These are the 26 letters of the basic English alphabet in upper and lower case, the digits 0..9, and a few essential punctuation marks. That's it. No diacritics, no umlauts, no Greek or Cyrillic letters.
In the late 80's, IT experts began to use the yet unused topmost bit and could now represent 256 different characters with a single byte. They still kept the lower half as it was defined by ASCII, but about the additional 128 characters, there was a mess for many years, because they were assigned differently on almost every computer or software.
In the 90's, a few different specifications came up to standardize the whole set of 256 characters. However, people from different parts of the world, speaking many different languages, each found that a different set of characters was necessary. That's why there are different 8bit encodings. Today, the most important one -at least in the Western world- is the ISO-8859-x family with the -x denoting a few variants, but the majority of characters is the same across all ISO-8859 encodings.
But still, the ISO-8859-x series didn't contain the full set of characters that were used even within Europe, let alone Asia. Hundreds of other characters were actually needed. As a solution to that Babylonian mix of characters sets and encodings that prevailed so far, expert all around the world created the Unicode character set which contains a standardized (and still growing) set of many thousand characters. It's obvious that they cannot be expressed with a single byte any longer, however, the guys made up a clever encoding as a compromise. They invented UTF-8 as one of the standard encodings for Unicode.
UTF-8 represents a character with a variable number of bytes. The 128 ASCII characters are stored as single bytes, as they always were. All characters beyond ASCII need 2, 3 or even 4 bytes. The letter 'ä' for example (used in German) is stored as a two byte sequence 0xC3, 0xA4.
However, if a program has to display text containing this character, it has to
know that this sequence is supposed to be one UTF-8 character. If it doesn't, and assumes a traditional 8bit encoding like ISO-8859-1, it wouldn't display "Mädchen" [German "girl"] as intended, but instead "Mädchen". It would take the two byte sequence as two separate characters. A reader who's familiar with the language can still guess what it means, but it looks garbage. Like the example in your first post.
On the other hand, there are byte sequences that are not valid as a UTF-8 code. If a program
expects UTF-8 code and encounters such an invalid sequence, it displays a replacement character, usually a question mark. Like your example above.
[END: Character encoding basics]
Quote:
Originally Posted by k_kush
How to identify if I am using the UTF-8 encoding?
|
If that web application that you're using is an established standard in your company, you won't want to change anything about it. Or maybe you're not even allowed.
Instead, you have to
know what encoding is supposed to be used with this application - there should be people who know, or documentation that tells you about this. Then you have to adapt your own tools to it. Probably your browser, probably your text editor. If all programs involved in the process use the same encoding, there's no problem, even if one might "forget" to tell the others about it.
[X] Doc CPU