Sort, Uniq and Merge Different Encoding Files

NeonFlash · 07-22-2012, 10:10 AM

Hi,

I have multiple files which I am trying to merge into one file.

Each of these files have a different encoding, some of them are UTF-8, some of them are, us-ascii, binary, unknown-8bit

when I try this:

Code:

cat file1.txt file2.txt | sort | uniq >> output.txt

I get the following error:

Code:

sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `monic\341\r' and `seedling\r'.

After setting the LC_ALL environment variable and retrying the above command, it works.

However, now the problem is, output.txt = file1.txt + file2.txt

The duplicate entries are not removed!!

I know that there are entries common between file1.txt and file2.txt

My question is, does the different encodings of the file impose some kind of restriction on the sort and uniq commands?

I thought of converting both the files to UTF-8 encoding as following:

Code:

iconv -f US-ASCII -t UTF-8 file1.txt >> file_utf8.txt

and surprisingly, even after this the encoding remains us-ascii:

Code:

file -bi file_utf8.txt
text/x-c++; charset=us-ascii

Now, I try converting the second file's encoding to UTF-8. At present it has unknown-8bit encoding:

Code:

iconv -f unknown-8bit -t UTF-8 file2.txt >> file2_utf8.txt
iconv: conversion from unknown-8bit unsupported

Yes, I understand that iconv does not have support for unknown-8bit encoding format.

So, to summarize:

1. Why are the duplicate entries not removed after merging the files using sort and uniq?

Possible Answer: Based on my understanding, both the files have a different encoding and it breaks the functionality of the sort command.

And if I suppress the error message by setting LC_ALL to "C", then the duplicate entries are not removed though the command is executed.

2. Why are the new encoding changes not reflected when I convert the file from US-ASCII to UTF-8 (as shown above in the example of file1.txt)?

I am not sure why exactly it happens.

3. How do I convert a file from unknown-8bit encoding to UTF-8?

iconv does not support unknown-8bit encoding and I don't know the encoding of the file either.

Thanks.

antegallya · 07-22-2012, 07:08 PM

Hello,

I'll answer your questions one by one.
1. You're right, a difference of encoding breaks string comparison functions. To understand that, you have to think about the internal representation of the characters in different encodings. Take the character "é". Its hexadecimal representation in latin1 is the byte E9 but its UTF-8 representation is the multibyte character C3 A9. So, a comparison byte to byte between those characters in different encodings would lead to think that those characters are different despite they actually represent the same character.
The sort command detects that you use an input that has characters from an encoding that is not compatible with your locale. So it issues that legitimate warning.
Then, setting your locale to 'C' forces the sort command to look at characters byte per byte. Thus a default byte to byte comparison is done, leading to non-equalities where there might be equalities.

2. Your conversion works actually. But US-ASCII characters retains the same representation in UTF-8. Thus the file is not changed.
Moreover, the encoding of a plain-text file is not stored anywhere. The detection of the encoding is done by reading characters from the file and guessing to which encoding they belong.
All your source characters being US-ASCII characters and after being translated to UTF-8 being the same characters, it is natural that the guessing algorithm still see the file as a US-ASCII encoded file.

3. You can't. "unknown-8bit" encoding stands for what it means. The encoding is *unknown*, so no automatic conversion is possible. You have to investigate your file to understand why it uses an "unknown" encoding.

NeonFlash · 07-23-2012, 12:06 AM

Thanks antegallya

That was helpful. I looked up the encoding of different characters in the Unicode table (Unicode Code Points and UTF-8 hex representation). Now, I understand it better

So, the reason why the duplicates were not removed after using the sort and uniq commands was that both the files had a different encoding. So, while doing a comparison byte by byte, it did not find any equalities even though they were the same characters but represented in different encoding.

Ok, so since US-ASCII has the same representation in UTF-8 (a single byte representation), so there is no change in the encoding of the file after applying iconv.

For the unknown-8bit, I have an idea:

Let's say, file1.txt has the unknown-8bit encoding:

I can do this:

Code:

cat file1.txt | grep ^"starwars"$ | od -t x2

and then observe the hexdump. This would tell me how the characters when read from this file are interpreted by the system.

In my case, I get the output as:

Code:

0000000 7473 7261 6177 7372 000a
0000011

Comparing these values with the ascii table, they are the one byte hex representation of the characters.

So, I am not sure why the encoding type is not detected by iconv, or is it because of some specific lines in the file which have characters stored in a different unknown encoding?

NeonFlash · 07-23-2012, 12:11 AM

And I tried the same command on the other file, file2.txt (with us-ascii encoding) and it shows that they are stored in the same representation:

Code:

cat file2.txt | grep ^"starwars"$ | od -t x2

output:

Code:

0000000 7473 7261 6177 7372 000a
0000011

Now, both the words are stored exactly in the same way, then if sort and uniq perform a byte by byte comparison, why is it not able to detect that they are equal and remove the duplicates?

antegallya · 07-23-2012, 06:48 AM

Well I suspect that your file2 use a mix of encodings and a mix of file formats or file1 and file2 don't use the same file format.

There are multiple types of line breaks, the most used ones are the unix and the dos one. The unix one uses only a line feed \n to end a line and the dos one uses carriage return and line feed \r\n. If the two are used for the same word, e.g. the starwars is in file1 and also in file2 but file1 uses unix style and file2 uses the dos style, then there will be two following byte sequence

Code:

starwars\n
starwars\r\n

which will be seen as two different words by uniq.
You can use the following to convert your files to the unix style :

Code:

sed "s/\r$//" file

If you want, you can attach your files so that I give a look at them.

BTW, you can use "sort -u" that will do the same job as "sort | uniq".