{Converting the encoding} Avoiding to use the name of a file more the one time!?

Roben · 08-08-2014, 01:06 AM

This command can change encoding of a file:

Code:

iconv -f Windows-1252 -t utf-8  abc.srt > OUTPUT.srt

`file -bi abc.srt | sed 's/.*=//'` instead of Windows-1252 makes the code more general:

Code:

iconv -f `file -bi abc.srt | sed 's/.*=//'` -t utf-8  abc.srt > OUTPUT.srt

but, the filename is used 2 times!
1. How to decrease it to once?
2. How to overwrite the output on the input file?
3. `file -bi` tries to guess the text encoding, so, it can be mistake! Is there any better replacement?

sag47 · 08-08-2014, 01:19 AM

It's not really clear what you're trying to do. Also, I almost missed what you were trying to do in the command substitution. You should really use $() instead of backticks (`) for command substitution.

What are you trying to do? I'm guessing you're trying to detect the encoding of the current file and convert it to utf-8. I'd rather avoid guessing. Be explicit and please provide the sample contents of abc.srt.

EDIT:

I think I see what you're trying to do. Create a small function instead and add it to one of your RC files (e.g. ~/.bashrc).

Code:

function toutf8() { 
  iconv -f "$(file -bi "$1" | sed 's/^.*=//')" -t "utf-8" "$1"
}

Then you can execute it on the command line like so...

Code:

toutf8 abc.srt > OUTPUT.srt

NevemTeve · 08-08-2014, 02:48 AM

@OP: you can use a variable to store the file-name with no problem. The problem is that encoding cannot be detected programmatically. Full stop. It can be checked if the file is valid as UTF-8, but even if it is, it still can be ISO-8859-x (no telling what x is between 1 and 16)

pan64 · 08-08-2014, 03:46 AM

probably you need variables, and bash parameter substitution, but I'm not really sure what do you want to achieve.

tize · 08-14-2014, 10:19 AM

Quote:

Originally Posted by sag47

Code:

function toutf8() { 
  iconv -f "$(file -bi "$1" | sed 's/^.*=//')" -t "utf-8" "$1"
}

Also:

Code:

function utf8er() {
  x=$(file -bi "$1"); iconv -f "${x##*charset=}" "$1";
}

However, "file" can not recognize encoding, it just guesses!
You need to work more on it to stop converting when it can not recognize encoding, whit an error in output, if you want it to be a general code.

firstfire · 08-14-2014, 02:51 PM

Hi.

Guessing encoding is a special task requiring special tools. To convert some russian text in unknown encoding to unicode I usually use konwert:

Code:

konvert any/ru-utf8 inputfile > outfile

There are other tools too, e.g. ICU. Besides encoding ICU also gives you the value of confidence in detected encoding.

None of these tools is perfect.

Roben · 08-14-2014, 03:28 PM

Quote:

Originally Posted by firstfire

None of these tools is perfect.

in your experience, which one has more accurate results: file, icu or konwert?

unfortunately, konwert just supports these languages: cs (Czech), de (German), el (Greek), eo (Esperanto), es (Spanish), fr (French), he (Hebrew), it (Italian), pl (Polish), pt (Portuguese), ru (Russian), and sv (Swedish).

firstfire · 08-15-2014, 04:53 AM

Hi.

konwert is the command-line tool and it works great for me. I just tried file -bi and it gives wrong encoding on my test input. I don't know any command line tool based on ICU (which is a library), so you probably have to write one yourself. But my experience in using it in nodejs was generally positive. At least you know when ICU failed to find encoding.

I just found another tool in ubuntu repos called uchardet. I works (detects encoding) fine on my test input.