LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-10-2009, 04:24 AM   #1
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Rep: Reputation: 16
character encoding


Hi I need to know what encoding has file. I tried:
Code:
file --mime-encoding index.html 
index.html: text/html
and
Code:
file --mime-type index.html 
index.html: text/html
also
Code:
enca index.html 
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.
what I am doing wrong ?
 
Old 10-10-2009, 04:53 AM   #2
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by wakatana View Post
Hi I need to know what encoding has file. I tried:
Code:
file --mime-encoding index.html 
index.html: text/html
and
Code:
file --mime-type index.html 
index.html: text/html
also
Code:
enca index.html 
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.
what I am doing wrong ?
You aren't telling us what you expected to happen, or what your goal is. One problem is that many file encodings don't unambiguously reveal themselves. For example, a UTF-8 encoded file can pass as extended ASCII, because the latter has valid 8-bit characters just as the former does. In that case, you have to tell a specific application how to interpret the file, but you cannot ask the system to tell you what it is.

A file with nothing but 7-bit characters (high bits all clear) is likely to be ASCII, but even this isn't guaranteed. Files with 8-bit characters are very likely to cause confusion unless you tell the system how to interpret them.
 
Old 10-10-2009, 09:05 AM   #3
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Original Poster
Rep: Reputation: 16
Thank you fo reply, I had no idea that encoding is so confusing and one encoding colud be represented as another.
My goal is know what encoding does file for further operations. In school we get excercies to download articles from web sites, parsing them and then storing it to database.
I expecting that encoding of various sites could be different, so i will have to convert between them.
For downloading I will use wget for parsing awk, sed. This points me to another question...
I read somewher that regular expressions work with ASCII table so when i type
Code:
grep "[a-z][a-z]*" file_name
it uses values from ACII dec97(a) to dec122(z), right ?
But if I have file containing diacritics, lets say (ordinary Slovak language characters):
Code:
marek@cepi:~$ cat diakritika 
ťľľščťž
ŤĽĽŠČŤŽ

marek@cepi:~$ grep -o "[a-z][a-z]*" diakritika 
ťľľščť


Why this regexp know diacritics? And why know only lower case and not "ž" ??? This is strange for me. Friend told me it could be something with $LANG. So my $LANG is:
Code:
marek@cepi:~$ echo $LANG
en_US.UTF-8
Also I would ask if I want uppercase file with diacritic i type:
Code:
marek@cepi:~$ cat diakritika | tr "[:lower:]" "[:upper:]"
ťľľščťž
ŤĽĽŠČŤŽ
why it not change lower to upper ?
Thanks a lot for reply
PS: I hope that characters display properly
 
Old 10-13-2009, 11:13 AM   #4
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Original Poster
Rep: Reputation: 16
nobody ?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Firefox Character Encoding Ingla Linux - General 4 10-13-2008 07:07 AM
Character encoding in this page dive General 1 03-16-2008 06:55 PM
Having Trouble with character encoding amit_usual Ubuntu 2 06-15-2007 01:35 PM
slrn character encoding Da_Timsta Linux - Software 2 03-15-2007 05:38 AM
Unsupported Character Encoding yenonn Linux - Software 3 04-28-2003 06:25 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration