LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-22-2012, 10:10 AM   #1
NeonFlash
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Rep: Reputation: Disabled
Sort, Uniq and Merge Different Encoding Files


Hi,

I have multiple files which I am trying to merge into one file.

Each of these files have a different encoding, some of them are UTF-8, some of them are, us-ascii, binary, unknown-8bit

when I try this:

Code:
cat file1.txt file2.txt | sort | uniq >> output.txt
I get the following error:

Code:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `monic\341\r' and `seedling\r'.
After setting the LC_ALL environment variable and retrying the above command, it works.

However, now the problem is, output.txt = file1.txt + file2.txt

The duplicate entries are not removed!!

I know that there are entries common between file1.txt and file2.txt

My question is, does the different encodings of the file impose some kind of restriction on the sort and uniq commands?

I thought of converting both the files to UTF-8 encoding as following:

Code:
iconv -f US-ASCII -t UTF-8 file1.txt >> file_utf8.txt
and surprisingly, even after this the encoding remains us-ascii:

Code:
file -bi file_utf8.txt
text/x-c++; charset=us-ascii
Now, I try converting the second file's encoding to UTF-8. At present it has unknown-8bit encoding:

Code:
iconv -f unknown-8bit -t UTF-8 file2.txt >> file2_utf8.txt
iconv: conversion from unknown-8bit unsupported
Yes, I understand that iconv does not have support for unknown-8bit encoding format.

So, to summarize:

1. Why are the duplicate entries not removed after merging the files using sort and uniq?

Possible Answer: Based on my understanding, both the files have a different encoding and it breaks the functionality of the sort command.

And if I suppress the error message by setting LC_ALL to "C", then the duplicate entries are not removed though the command is executed.

2. Why are the new encoding changes not reflected when I convert the file from US-ASCII to UTF-8 (as shown above in the example of file1.txt)?

I am not sure why exactly it happens.

3. How do I convert a file from unknown-8bit encoding to UTF-8?

iconv does not support unknown-8bit encoding and I don't know the encoding of the file either.

Thanks.
 
Old 07-22-2012, 07:08 PM   #2
antegallya
Member
 
Registered: Jun 2008
Location: Belgium
Distribution: Debian
Posts: 109

Rep: Reputation: 42
Hello,

I'll answer your questions one by one.
1. You're right, a difference of encoding breaks string comparison functions. To understand that, you have to think about the internal representation of the characters in different encodings. Take the character "é". Its hexadecimal representation in latin1 is the byte E9 but its UTF-8 representation is the multibyte character C3 A9. So, a comparison byte to byte between those characters in different encodings would lead to think that those characters are different despite they actually represent the same character.
The sort command detects that you use an input that has characters from an encoding that is not compatible with your locale. So it issues that legitimate warning.
Then, setting your locale to 'C' forces the sort command to look at characters byte per byte. Thus a default byte to byte comparison is done, leading to non-equalities where there might be equalities.

2. Your conversion works actually. But US-ASCII characters retains the same representation in UTF-8. Thus the file is not changed.
Moreover, the encoding of a plain-text file is not stored anywhere. The detection of the encoding is done by reading characters from the file and guessing to which encoding they belong.
All your source characters being US-ASCII characters and after being translated to UTF-8 being the same characters, it is natural that the guessing algorithm still see the file as a US-ASCII encoded file.

3. You can't. "unknown-8bit" encoding stands for what it means. The encoding is *unknown*, so no automatic conversion is possible. You have to investigate your file to understand why it uses an "unknown" encoding.

Last edited by antegallya; 07-22-2012 at 07:10 PM.
 
2 members found this post helpful.
Old 07-23-2012, 12:06 AM   #3
NeonFlash
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Thanks antegallya

That was helpful. I looked up the encoding of different characters in the Unicode table (Unicode Code Points and UTF-8 hex representation). Now, I understand it better

So, the reason why the duplicates were not removed after using the sort and uniq commands was that both the files had a different encoding. So, while doing a comparison byte by byte, it did not find any equalities even though they were the same characters but represented in different encoding.

Ok, so since US-ASCII has the same representation in UTF-8 (a single byte representation), so there is no change in the encoding of the file after applying iconv.

For the unknown-8bit, I have an idea:

Let's say, file1.txt has the unknown-8bit encoding:

I can do this:

Code:
cat file1.txt | grep ^"starwars"$ | od -t x2
and then observe the hexdump. This would tell me how the characters when read from this file are interpreted by the system.

In my case, I get the output as:

Code:
0000000 7473 7261 6177 7372 000a
0000011
Comparing these values with the ascii table, they are the one byte hex representation of the characters.

So, I am not sure why the encoding type is not detected by iconv, or is it because of some specific lines in the file which have characters stored in a different unknown encoding?
 
Old 07-23-2012, 12:11 AM   #4
NeonFlash
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
And I tried the same command on the other file, file2.txt (with us-ascii encoding) and it shows that they are stored in the same representation:

Code:
cat file2.txt | grep ^"starwars"$ | od -t x2
output:

Code:
0000000 7473 7261 6177 7372 000a
0000011
Now, both the words are stored exactly in the same way, then if sort and uniq perform a byte by byte comparison, why is it not able to detect that they are equal and remove the duplicates?
 
Old 07-23-2012, 06:48 AM   #5
antegallya
Member
 
Registered: Jun 2008
Location: Belgium
Distribution: Debian
Posts: 109

Rep: Reputation: 42
Well I suspect that your file2 use a mix of encodings and a mix of file formats or file1 and file2 don't use the same file format.

There are multiple types of line breaks, the most used ones are the unix and the dos one. The unix one uses only a line feed \n to end a line and the dos one uses carriage return and line feed \r\n. If the two are used for the same word, e.g. the starwars is in file1 and also in file2 but file1 uses unix style and file2 uses the dos style, then there will be two following byte sequence
Code:
starwars\n
starwars\r\n
which will be seen as two different words by uniq.
You can use the following to convert your files to the unix style :
Code:
sed "s/\r$//" file
If you want, you can attach your files so that I give a look at them.

BTW, you can use "sort -u" that will do the same job as "sort | uniq".
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sort and uniq conundrum BFCsaus Linux - Newbie 4 03-30-2012 04:08 AM
[SOLVED] using sort and uniq in bash bibiki Linux - Newbie 2 02-19-2011 10:12 AM
history |tr '\011' ' ' |tr -s " "| cut -d' ' -f3 |sort |uniq -c |sort -nbr |head -n10 alan_ri General 12 12-04-2010 09:01 PM
[SOLVED] bash - merging strings (perhaps with sort | uniq) cmbouchard Linux - Newbie 4 11-16-2010 11:21 PM
sort & uniq tostay2003 Programming 3 06-28-2008 06:14 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:15 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration