Converting Files to UTF-8

timi19 · 09-20-2012, 09:45 AM

Hello Community

Im writing currently a small script for a task. I want to convert all founded files to UTF-8. The problem is that i dont know the original charset. Sometimes it is binary, ascii or unknown:s
I've already research and found the tool recode but it doesnt do what i want. i could make a big if else if but i thought there must be an other way. "file -bi" show me that the original charset is "text/plain; charset=unknown-8bit". I have about ~200 Files that I want to convert:s

Here's my code:

Code:

#!/bin/bash
 output=`find . -name "language.??.properties"`
 text=" will be changed to utf-8"
 for file in $output
 do
 charset=`file -bi $file`
 if [[ $charset != *utf-8* ]] 
   then 
   echo $file /// [$charset] $text
   
   #logic
 fi
 done

I hope you can help me

tc_ · 09-20-2012, 10:01 AM

iconv does the job.

Code:

iconv -t utf8 OLDFILE > NEWFILE

timi19 · 09-20-2012, 10:02 AM

Thank you for your answer!
But how can i save to the same file?

tc_ · 09-20-2012, 10:07 AM

Quote:

Originally Posted by timi19

Thank you for your answer!
But how can i save to the same file?

Code:

for f in ${files}; do
    iconv -t utf8 ${f} > /tmp/utf8conversion
    mv /tmp/utf8conversion ${f}
done

However, why not keeping backups until you know that everything worked well?

NevemTeve · 09-20-2012, 10:08 AM

You cannot convert if you don't know the original charset.
If you knew, you could do sg like this:

Code:

iconv -f ISO-8859-1 -t UTF-8 "$NAME" >"$NAME.new" &&
mv "$NAME" "$NAME.old" &&
mv "$NAME.new" "$NAME"

tc_ · 09-20-2012, 10:23 AM

Quote:

Originally Posted by NevemTeve

You cannot convert if you don't know the original charset.
If you knew, you could do sg like this:

Code:

iconv -f ISO-8859-1 -t UTF-8 "$NAME" >"$NAME.new" &&
mv "$NAME" "$NAME.old" &&
mv "$NAME.new" "$NAME"

Granted. You're right. However, I think

Code:

iconv -f $( file -b --mime-encoding FILE ) -t utf-8 FILE > FILE.NEW

should work.

NevemTeve · 09-20-2012, 10:30 AM

(Don't bet big sums on its guesses... for any iso-8859-x it will return iso-8859-1, for cp-85x, unknown-8bit)

tc_ · 09-20-2012, 12:00 PM

Quote:

Originally Posted by NevemTeve

(Don't bet big sums on its guesses... for any iso-8859-x it will return iso-8859-1, for cp-85x, unknown-8bit)

Allright, I'll be quiet.

timi19 · 09-21-2012, 01:25 AM

Thank you for your solutions. But What can i do now?:s
NevemTeve said i've to know the original charset but i don't know respectively it's not always the same and there are and also there are some files which which are uknown-8bit. What should i do with them?

Thank you for help

NevemTeve · 09-21-2012, 02:25 AM

Basically, if you don't know the encoding, then you don't know the encoding. You might guess by reading the files one-by-one, or you might simply assume something (say iso-8859-1).

timi19 · 09-21-2012, 08:55 AM

Hello

I found out that one unknown file ISO-8859-2 is. So i tried to convert it with iconv but it doesn't convert as i want. I found out the original charset with

Code:

chardet < language.en.properties

.The file contains german umlauts like äüö but i can't see it because of the wrong charset. Notepad++ says that the file is AINSI. And when i try to convert it the special german chars just turn to a question mark.

I will try now a hex editor. Are there also other possibilites?

Hidden Windshield · 09-21-2012, 11:33 AM

If the file isn't converting properly, then you have the wrong charset. The "chardet" program uses a heuristic analysis algorithm, which is a fancy way of saying that it might be wrong. If Notepad++ says the file is ANSI, and Notepad++ can display the file correctly, than the file is probably in ANSI. iconv understands several different versions of ANSI, so just try them all until you find the right one. You can find a list of charsets by running "iconv -l".

On second thought, Notepad++ has the capability of converting files to UTF-8, so why not just do that? You'll be done with that one file, anyway.

NevemTeve · 09-22-2012, 08:35 AM

The exact meaning of "ANSI" is: either ISO-8859-x or something else.

David the H. · 09-23-2012, 09:35 AM

If chardet doesn't give you a "1.0" score, then don't trust it. I've found it to be wrong more often than not. I've found uchardet to give more reliable results, but it's not perfect either. It is faster though, being binary.

In short, there's not perfectly reliable way to script something like this.

BTW, remember also that simple ascii is considered valid utf-8.

Now a couple of comments on the OP code:

Code:

output=`find . -name "language.??.properties"`
for file in $output; do

1) $(..) is highly recommended over `..`

2) Don't store lists of things in a single variable. And then, don't read them with for. Use an array, or process it with a while+read loop.

A for loop with simple globbing would also work, if you don't need to be recursive.

Code:

#!/bin/bash

findpat="language.??.properties"
comppat='utf-8|ascii'

while IFS='' read -r -d '' fname; do

	charset=$( uchardet "$fname" )

	if [[ ! $charset =~ $comppat ]] 

		echo "$file:$charset"
		echo "Needs to be changed to utf-8"

	fi

done < <( find . -name "$findpat" -print0 )

Untested, but should at least help you figure out what to do next.