LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-20-2012, 09:45 AM   #1
timi19
LQ Newbie
 
Registered: Sep 2012
Posts: 4

Rep: Reputation: Disabled
Converting Files to UTF-8


Hello Community

Im writing currently a small script for a task. I want to convert all founded files to UTF-8. The problem is that i dont know the original charset. Sometimes it is binary, ascii or unknown:s
I've already research and found the tool recode but it doesnt do what i want. i could make a big if else if but i thought there must be an other way. "file -bi" show me that the original charset is "text/plain; charset=unknown-8bit". I have about ~200 Files that I want to convert:s

Here's my code:

Code:
#!/bin/bash
 output=`find . -name "language.??.properties"`
 text=" will be changed to utf-8"
 for file in $output
 do
 charset=`file -bi $file`
 if [[ $charset != *utf-8* ]] 
   then 
   echo $file /// [$charset] $text
   
   #logic
 fi
 done
I hope you can help me
 
Old 09-20-2012, 10:01 AM   #2
tc_
LQ Newbie
 
Registered: Sep 2010
Location: Germany
Distribution: Slackware
Posts: 28

Rep: Reputation: 30
iconv does the job.

Code:
iconv -t utf8 OLDFILE > NEWFILE
 
Old 09-20-2012, 10:02 AM   #3
timi19
LQ Newbie
 
Registered: Sep 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thank you for your answer!
But how can i save to the same file?
 
Old 09-20-2012, 10:07 AM   #4
tc_
LQ Newbie
 
Registered: Sep 2010
Location: Germany
Distribution: Slackware
Posts: 28

Rep: Reputation: 30
Quote:
Originally Posted by timi19 View Post
Thank you for your answer!
But how can i save to the same file?
Code:
for f in ${files}; do
    iconv -t utf8 ${f} > /tmp/utf8conversion
    mv /tmp/utf8conversion ${f}
done
However, why not keeping backups until you know that everything worked well?
 
Old 09-20-2012, 10:08 AM   #5
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
You cannot convert if you don't know the original charset.
If you knew, you could do sg like this:

Code:
iconv -f ISO-8859-1 -t UTF-8 "$NAME" >"$NAME.new" &&
mv "$NAME" "$NAME.old" &&
mv "$NAME.new" "$NAME"
 
Old 09-20-2012, 10:23 AM   #6
tc_
LQ Newbie
 
Registered: Sep 2010
Location: Germany
Distribution: Slackware
Posts: 28

Rep: Reputation: 30
Quote:
Originally Posted by NevemTeve View Post
You cannot convert if you don't know the original charset.
If you knew, you could do sg like this:

Code:
iconv -f ISO-8859-1 -t UTF-8 "$NAME" >"$NAME.new" &&
mv "$NAME" "$NAME.old" &&
mv "$NAME.new" "$NAME"
Granted. You're right. However, I think
Code:
iconv -f $( file -b --mime-encoding FILE ) -t utf-8 FILE > FILE.NEW
should work.
 
1 members found this post helpful.
Old 09-20-2012, 10:30 AM   #7
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
(Don't bet big sums on its guesses... for any iso-8859-x it will return iso-8859-1, for cp-85x, unknown-8bit)
 
Old 09-20-2012, 12:00 PM   #8
tc_
LQ Newbie
 
Registered: Sep 2010
Location: Germany
Distribution: Slackware
Posts: 28

Rep: Reputation: 30
Wink

Quote:
Originally Posted by NevemTeve View Post
(Don't bet big sums on its guesses... for any iso-8859-x it will return iso-8859-1, for cp-85x, unknown-8bit)
Allright, I'll be quiet.
 
Old 09-21-2012, 01:25 AM   #9
timi19
LQ Newbie
 
Registered: Sep 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thank you for your solutions. But What can i do now?:s
NevemTeve said i've to know the original charset but i don't know respectively it's not always the same and there are and also there are some files which which are uknown-8bit. What should i do with them?

Thank you for help
 
Old 09-21-2012, 02:25 AM   #10
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Basically, if you don't know the encoding, then you don't know the encoding. You might guess by reading the files one-by-one, or you might simply assume something (say iso-8859-1).
 
Old 09-21-2012, 08:55 AM   #11
timi19
LQ Newbie
 
Registered: Sep 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hello

I found out that one unknown file ISO-8859-2 is. So i tried to convert it with iconv but it doesn't convert as i want. I found out the original charset with
Code:
chardet < language.en.properties
.The file contains german umlauts like δόφ but i can't see it because of the wrong charset. Notepad++ says that the file is AINSI. And when i try to convert it the special german chars just turn to a question mark.

I will try now a hex editor. Are there also other possibilites?
 
Old 09-21-2012, 11:33 AM   #12
Hidden Windshield
Member
 
Registered: Jul 2010
Distribution: Fedora
Posts: 68

Rep: Reputation: 27
If the file isn't converting properly, then you have the wrong charset. The "chardet" program uses a heuristic analysis algorithm, which is a fancy way of saying that it might be wrong. If Notepad++ says the file is ANSI, and Notepad++ can display the file correctly, than the file is probably in ANSI. iconv understands several different versions of ANSI, so just try them all until you find the right one. You can find a list of charsets by running "iconv -l".

On second thought, Notepad++ has the capability of converting files to UTF-8, so why not just do that? You'll be done with that one file, anyway.
 
Old 09-22-2012, 08:35 AM   #13
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
The exact meaning of "ANSI" is: either ISO-8859-x or something else.
 
Old 09-23-2012, 09:35 AM   #14
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
If chardet doesn't give you a "1.0" score, then don't trust it. I've found it to be wrong more often than not. I've found uchardet to give more reliable results, but it's not perfect either. It is faster though, being binary.

In short, there's not perfectly reliable way to script something like this.

BTW, remember also that simple ascii is considered valid utf-8.


Now a couple of comments on the OP code:
Code:
output=`find . -name "language.??.properties"`
for file in $output; do
1) $(..) is highly recommended over `..`

2) Don't store lists of things in a single variable. And then, don't read them with for. Use an array, or process it with a while+read loop.

A for loop with simple globbing would also work, if you don't need to be recursive.


Code:
#!/bin/bash

findpat="language.??.properties"
comppat='utf-8|ascii'

while IFS='' read -r -d '' fname; do

	charset=$( uchardet "$fname" )

	if [[ ! $charset =~ $comppat ]] 

		echo "$file:$charset"
		echo "Needs to be changed to utf-8"

	fi

done < <( find . -name "$findpat" -print0 )
Untested, but should at least help you figure out what to do next.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Error converting text from IM to UTF-8 sjackman Linux - Software 0 02-01-2012 04:53 PM
[SOLVED] How to convert files to UTF-8 webhope Linux - Software 17 05-12-2010 02:46 PM
Converting UTF-16 files to another encoding (such as UTF-8) crisostomo_enrico Solaris / OpenSolaris 3 03-25-2008 05:30 PM
im getting UTF-8 to STRING: Could not open converter from 'UTF-8' to 'ISO-8859-1' jabka Linux - Newbie 2 11-24-2006 05:44 AM
[Enter] in text documents diffrent on Windows and Linux? UTF-8/UTF-16 problem or? brynjarh Linux - General 1 11-24-2004 05:20 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration