battler 04-15-2012 01:42 PM

Windows 7 txt file to Linux conversion problems

I'm having a lot of trouble using a .txt file created by Microsoft Office Word in Linux.

I save the file in Windows Word as a txt file and select save as Unicode (UTF-8). That is also the end output that I need. I than have a conversion program in Linux Ubuntu that needs to run on this file. However I run in to difficulties because the text file contains characters like <C3><AF> when I use cat or Emacs.

I tried almost everything. Saving in different formats, converting with iconf and dos2unix, checking the Ubuntu character standard. But I've always ended up with the same problem, characters between <>. Is there someone who can give me the winning combination?

example line: Ze werken op de computer waarop ze ge<95>nstalleerd zijn
How it should be: Ze werken op de computer waarop ze ge´nstalleerd zijn

headrift 04-15-2012 02:31 PM

It looks like the <95> is extended ASCII... the hex for the letter you want, in the example case. I'd make sure Emacs (and your shell in general) is running in utf-8.

Somewhere I have a script that strips diacritical marks off letters, but I'm guessing you want to keep them.

battler 04-15-2012 02:55 PM

Thanks to this chart I discovered that it is UTF-8 only is HEX format, still searching for a way to convert this to normal UTF-8.

battler 04-15-2012 03:48 PM

I solved it, it had nothing to do with program conversion. My locals were wrong. I've changed the following file: /etc/default/locale



Reading material:

John VV 04-15-2012 03:49 PM

have you looked at "dos2unix" and "unix2dos"

however the easiest thing it to
NOT use Microsoft Windows Office to save a text only file

MS Office is known to cause Linux, IBM Unix, and Apple Mac users all kinds of problems
and even cross platform programs ON WINDOWS problems

on windows the best normal everyday test editor ( just plain text) is SciTe

