LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Strange characters in a file (https://www.linuxquestions.org/questions/linux-newbie-8/strange-characters-in-a-file-4175527523/)

lonesoac0 12-06-2014 12:00 PM

Strange characters in a file
 
1 Attachment(s)
Hello all,

I recently got a dataset from the website of http://www.omdbapi.com/. Within the data, I see a question mark with a black background. It looks like random characters cannot be read. I ran the command of more FILE_NAME.txt and I get the results I describe. I have also attached a screenshot.

Ser Olmy 12-06-2014 01:42 PM

The text file was created/saved on a system using a different character set encoding than the computer/application you're using.

Only US ASCII codes are reasonably universal; these include characters A-Z and a-z, numbers, basic punctuation and a small selection of special characters like the dollar, the hash and percentage signs, some very basic mathematical symbols and so on. Other characters are considered "special", and various encoding schemes exist to handle various types of "extended" character sets.

If there's a mismatch between the encoding schemes used by a sender and a recipient of data, any "extended" codes may be interpreted incorrectly. In your case, accented characters aren't displayed properly. This is a very common problem with "pure" text files, since they lack any sort of header that identifies the character set encoding scheme being used.

If you can figure out which encoding scheme was used to create the file, you can convert it to the encoding scheme you're using with the iconv command.

jpollard 12-06-2014 03:24 PM

As above...

but in addition, this is common when the data originates on a Windows system. Microsoft software tends to generate/use some not quite standard character sets. In at least one instance such screwups involved having a parity bit set on the apostrophe character... thus showing up as a ? instead.

In your specific case, it does look a bit more like just a different character font, but it could just be some Windows software with the not-quite-standard characters.


All times are GMT -5. The time now is 07:19 PM.