LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-08-2014, 01:06 AM   #1
Roben
LQ Newbie
 
Registered: May 2013
Posts: 6

Rep: Reputation: Disabled
{Converting the encoding} Avoiding to use the name of a file more the one time!?


This command can change encoding of a file:
Code:
iconv -f Windows-1252 -t utf-8  abc.srt > OUTPUT.srt
`file -bi abc.srt | sed 's/.*=//'` instead of Windows-1252 makes the code more general:
Code:
iconv -f `file -bi abc.srt | sed 's/.*=//'` -t utf-8  abc.srt > OUTPUT.srt
but, the filename is used 2 times!
1. How to decrease it to once?
2. How to overwrite the output on the input file?
3. `file -bi` tries to guess the text encoding, so, it can be mistake! Is there any better replacement?
 
Old 08-08-2014, 01:19 AM   #2
sag47
Senior Member
 
Registered: Sep 2009
Location: Orange County, CA
Distribution: Kubuntu x64, Raspbian, CentOS
Posts: 1,860
Blog Entries: 36

Rep: Reputation: 458Reputation: 458Reputation: 458Reputation: 458Reputation: 458
It's not really clear what you're trying to do. Also, I almost missed what you were trying to do in the command substitution. You should really use $() instead of backticks (`) for command substitution.

What are you trying to do? I'm guessing you're trying to detect the encoding of the current file and convert it to utf-8. I'd rather avoid guessing. Be explicit and please provide the sample contents of abc.srt.

EDIT:

I think I see what you're trying to do. Create a small function instead and add it to one of your RC files (e.g. ~/.bashrc).

Code:
function toutf8() { 
  iconv -f "$(file -bi "$1" | sed 's/^.*=//')" -t "utf-8" "$1"
}
Then you can execute it on the command line like so...

Code:
toutf8 abc.srt > OUTPUT.srt

Last edited by sag47; 08-08-2014 at 01:30 AM.
 
Old 08-08-2014, 02:48 AM   #3
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 3,762

Rep: Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229Reputation: 1229
@OP: you can use a variable to store the file-name with no problem. The problem is that encoding cannot be detected programmatically. Full stop. It can be checked if the file is valid as UTF-8, but even if it is, it still can be ISO-8859-x (no telling what x is between 1 and 16)
 
Old 08-08-2014, 03:46 AM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 12,605

Rep: Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939Reputation: 3939
probably you need variables, and bash parameter substitution, but I'm not really sure what do you want to achieve.
 
Old 08-14-2014, 10:19 AM   #5
tize
LQ Newbie
 
Registered: Aug 2014
Posts: 2

Rep: Reputation: Disabled
Quote:
Originally Posted by sag47 View Post
Code:
function toutf8() { 
  iconv -f "$(file -bi "$1" | sed 's/^.*=//')" -t "utf-8" "$1"
}
Also:
Code:
function utf8er() {
  x=$(file -bi "$1"); iconv -f "${x##*charset=}" "$1";
}
However, "file" can not recognize encoding, it just guesses!
You need to work more on it to stop converting when it can not recognize encoding, whit an error in output, if you want it to be a general code.

Last edited by tize; 08-14-2014 at 11:51 AM.
 
Old 08-14-2014, 02:51 PM   #6
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 427Reputation: 427Reputation: 427Reputation: 427Reputation: 427
Hi.

Guessing encoding is a special task requiring special tools. To convert some russian text in unknown encoding to unicode I usually use konwert:

Code:
konvert any/ru-utf8 inputfile > outfile
There are other tools too, e.g. ICU. Besides encoding ICU also gives you the value of confidence in detected encoding.

None of these tools is perfect.
 
Old 08-14-2014, 03:28 PM   #7
Roben
LQ Newbie
 
Registered: May 2013
Posts: 6

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by firstfire View Post
None of these tools is perfect.
in your experience, which one has more accurate results: file, icu or konwert?

unfortunately, konwert just supports these languages: cs (Czech), de (German), el (Greek), eo (Esperanto), es (Spanish), fr (French), he (Hebrew), it (Italian), pl (Polish), pt (Portuguese), ru (Russian), and sv (Swedish).

Last edited by Roben; 08-14-2014 at 03:44 PM.
 
Old 08-15-2014, 04:53 AM   #8
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 427Reputation: 427Reputation: 427Reputation: 427Reputation: 427
Hi.

konwert is the command-line tool and it works great for me. I just tried file -bi and it gives wrong encoding on my test input. I don't know any command line tool based on ICU (which is a library), so you probably have to write one yourself. But my experience in using it in nodejs was generally positive. At least you know when ICU failed to find encoding.

I just found another tool in ubuntu repos called uchardet. I works (detects encoding) fine on my test input.
 
  


Reply

Tags
encoding, subtitle


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help on converting date time format in text file and triggering action depam Linux - General 3 11-28-2012 09:46 AM
Avoiding CPU use for use with while and data=$(tail -n 1 file.log) ? patrick295767 Programming 8 07-23-2012 05:49 AM
converting text to postscript with encoding edscott Linux - Software 1 01-02-2011 04:06 AM
Converting UTF-16 files to another encoding (such as UTF-8) crisostomo_enrico Solaris / OpenSolaris 3 03-25-2008 05:30 PM
converting file encoding (batch files) with same output filename yuubouna Linux - Newbie 1 01-14-2007 08:32 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:46 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration