LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   how to search in files text that is one-byte encoding? (enc. that's not unicode) (https://www.linuxquestions.org/questions/linux-general-1/how-to-search-in-files-text-that-is-one-byte-encoding-enc-thats-not-unicode-830104/)

qdinar 09-03-2010 02:20 AM

how to search in files text that is one-byte encoding? (enc. that's not unicode)
 
how to search in files text that is one-byte encoding? places - search for files in gnome in ubuntu searches only utf-8 text.

i know one way: install wine and total commander, then search with it. what are better ways?

i have asked this in https://answers.launchpad.net/ubuntu/+question/123912 and in http://ubuntuforums.org/showthread.php?t=1564911 and in freenode channels.

kbp 09-03-2010 11:15 PM

utf8 includes 1 byte encoded characters doesn't it ?

ref: http://en.wikipedia.org/wiki/UTF-8

qdinar 09-04-2010 12:49 AM

utf-8 includes only latin letters and several other marks like punctuation marks as one bytes. they are near 128 . in one-byte encodings most of them and additionally near 128 letters are one-bytes, which are non-latin letters, like cyrillic, or latin with diacritics. ;)

qdinar 07-18-2011 02:23 AM

ubuntu's search tool cannot find one-byte encoded characters, because it tries to read them as utf-8 and cannot read them. it only can read latin letters, numbers - (ascii?) that are universally encoded both in one-byte encodings and in utf-8. other(additional) 128 letters of one-byte encoded text it reads as error or accidentally as an random unicode letter, it is in many times a chinese character.

David the H. 07-18-2011 03:44 AM

Run the file through iconv to a new file) to change the encoding to utf-8, then use that. There's a tool called chardet that can tell you the exact encoding of the file.

Mayn of the major text editors can also autodetect the encoding, and have the ability to save the text back in a different encoding.

UTF-8 uses the same encoding as ascii for the first tier of characters, so an ascii-encoded file is also valid UTF-8. But characters beyond ascii involve multiple bytes.


All times are GMT -5. The time now is 10:37 PM.