Hi,
I have multiple files which I am trying to merge into one file.
Each of these files have a different encoding, some of them are UTF-8, some of them are, us-ascii, binary, unknown-8bit
when I try this:
Code:
cat file1.txt file2.txt | sort | uniq >> output.txt
I get the following error:
Code:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `monic\341\r' and `seedling\r'.
After setting the LC_ALL environment variable and retrying the above command, it works.
However, now the problem is, output.txt = file1.txt + file2.txt
The duplicate entries are not removed!!
I know that there are entries common between file1.txt and file2.txt
My question is, does the different encodings of the file impose some kind of restriction on the sort and uniq commands?
I thought of converting both the files to UTF-8 encoding as following:
Code:
iconv -f US-ASCII -t UTF-8 file1.txt >> file_utf8.txt
and surprisingly, even after this the encoding remains us-ascii:
Code:
file -bi file_utf8.txt
text/x-c++; charset=us-ascii
Now, I try converting the second file's encoding to UTF-8. At present it has unknown-8bit encoding:
Code:
iconv -f unknown-8bit -t UTF-8 file2.txt >> file2_utf8.txt
iconv: conversion from unknown-8bit unsupported
Yes, I understand that iconv does not have support for unknown-8bit encoding format.
So, to summarize:
1. Why are the duplicate entries not removed after merging the files using sort and uniq?
Possible Answer: Based on my understanding, both the files have a different encoding and it breaks the functionality of the sort command.
And if I suppress the error message by setting LC_ALL to "C", then the duplicate entries are not removed though the command is executed.
2. Why are the new encoding changes not reflected when I convert the file from US-ASCII to UTF-8 (as shown above in the example of file1.txt)?
I am not sure why exactly it happens.
3. How do I convert a file from unknown-8bit encoding to UTF-8?
iconv does not support unknown-8bit encoding and I don't know the encoding of the file either.
Thanks.