LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Solaris / OpenSolaris (http://www.linuxquestions.org/questions/solaris-opensolaris-20/)
-   -   Converting UTF-16 files to another encoding (such as UTF-8) (http://www.linuxquestions.org/questions/solaris-opensolaris-20/converting-utf-16-files-to-another-encoding-such-as-utf-8-a-630588/)

crisostomo_enrico 03-25-2008 04:47 PM

Converting UTF-16 files to another encoding (such as UTF-8)
 
Hi.

I received a bunch (>1700) of scripts generated by a Microsoft SQL Server Enterprise Manager and I must work on them. I think they are UTF-16 files, which is the internal representation of text of Windows >= 2000 and on Solaris they just appear as data.
Quote:

bash-3.2$ file dbo.tTransactionIncidents.TAB
dbo.tTransactionIncidents.TAB: data
I mean, I cannot grep or sed through them if I don't re-encode them. With vim, I can :set fileencoding=utf-8, then update and write the file, and it works, but the problem is that the number of files is so high that I need a way to do it with a script and I'm not aware of any tool or command (not even vim) to do the work with.

Have you got any suggestion?
Thanks a lot,
Enrico.

bulliver 03-25-2008 06:04 PM

Code:

#!/usr/bin/ruby

require 'iconv'
ic = Iconv.new("ASCII", "UTF-16LE") # replace 'ASCII' with 'UTF-8' if you prefer

ARGV.each do |file|
  in_file = File.new(file).readlines
  out_file = File.new("#{file}.out", "w")
  in_file.each do |line|
    out_file.write(ic.iconv(line))
  end
  out_file.close
end

Note: This is untested. Will re-encode all input files to ascii and name as: "original_name.out".
You will need to use shell globbing or find/xargs to supply it with all your file names.

HTH

Edit:

You can skip the middleman. Ruby iconv is just a wrapper for the iconv C library/utility. Have a look at 'man iconv'.

jlliagre 03-25-2008 06:20 PM

Or simpler:
Code:

iconv -f UTF-16 -t UTF-8 file

crisostomo_enrico 03-25-2008 06:30 PM

Thank you very much, to both of you, it works!

Enrico.


All times are GMT -5. The time now is 02:00 AM.