LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Problems using awk/sed/sort with a ucs-2le encoded file (http://www.linuxquestions.org/questions/programming-9/problems-using-awk-sed-sort-with-a-ucs-2le-encoded-file-787176/)

Jem7v! 02-05-2010 03:36 AM

Problems using awk/sed/sort with a ucs-2le encoded file
 
Hello

I'm having lots of fun and games trying to use (g)awk (and sed, sort) with a file encoded in ucs-2le.
The overall command looks like this:
sed '1d' ./ucs-2le_file.txt | sort -t '' -n -k 2,2 -k 3,3 -k 5,5 -T . | awk -F"" -f aggregate.awk > new_file.txt

Ideally I would like the new file to be created with the same ucs-2le encoding but I don't think it is.

To give some background:
Previously a large file encoded in ucs-2le was FTP'd to the server and then loaded into an Oracle table using SQL*loader and using a UTF16 character set (parameter in the control file)
To improve performance I'm trying to remove and aggregate data within the file so the SQL*Load and the subsequent SQL has less data to play with.
I'm therefore trying to use the existing process but adding an additional step to create a smaller file (using awk/sort/sed as above) and use the same SQL*loader control file to load the new file with the reduced dataset.
Unforunately after the new file has been created the SQL*loader part fails because it can no longer recognise the delimiter and end of line characters (The control file specifies the character hex values. And when the new file is viewed in vim there are addition "^@" characters inbetween every 'normal' acsii character I would expect to see). This has led me to believe it's a character set/encoding problem. I've experimented with modifing the locale but to no avail.

So, the question is can awk/sort/sed support multi-byte character sets? (I've checked vim and it does). If so what do I need to do to allow this. If not can someone suggest an alternative approach.

Server details (i.e. uname -a):
Linux migdev 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Many thanks for your help.

jschiwal 02-05-2010 03:41 AM

Since you mentioned vim, here is an article on converting from ucs2le to utf8 which gawk and sed should be able to handle:
http://krzysztofcierpisz.blogspot.co...icode-ucs.html

David the H. 02-05-2010 04:42 AM

In more general terms, iconv is the usual app to convert between character encodings.

Code:

iconv -f UCS-2LE" -t UTF-8 input_file

Jem7v! 02-05-2010 07:03 AM

Quote:

Originally Posted by jschiwal (Post 3853548)
Since you mentioned vim, here is an article on converting from ucs2le to utf8 which gawk and sed should be able to handle:
http://krzysztofcierpisz.blogspot.co...icode-ucs.html

Thanks for the response. I had already seen that web page and had tried
it but unfortunately didn't work (the SQL*loader fails).
Basically I did the following steps:
Open the ucs-2le file in vim
Save as utf-8
Run awk to generate new file
Open new file as utf-8 in vim
Save as ucs-2le
SQL*Load the new file.

So, in general awk and sed do not support ucs2le or utf-16 encodings? Therefore to use them an explcit conversion is required?

Thanks again.


All times are GMT -5. The time now is 12:33 PM.