Problems using awk/sed/sort with a ucs-2le encoded file
I'm having lots of fun and games trying to use (g)awk (and sed, sort) with a file encoded in ucs-2le.
The overall command looks like this:
sed '1d' ./ucs-2le_file.txt | sort -t '¤' -n -k 2,2 -k 3,3 -k 5,5 -T . | awk -F"¤" -f aggregate.awk > new_file.txt
Ideally I would like the new file to be created with the same ucs-2le encoding but I don't think it is.
To give some background:
Previously a large file encoded in ucs-2le was FTP'd to the server and then loaded into an Oracle table using SQL*loader and using a UTF16 character set (parameter in the control file)
To improve performance I'm trying to remove and aggregate data within the file so the SQL*Load and the subsequent SQL has less data to play with.
I'm therefore trying to use the existing process but adding an additional step to create a smaller file (using awk/sort/sed as above) and use the same SQL*loader control file to load the new file with the reduced dataset.
Unforunately after the new file has been created the SQL*loader part fails because it can no longer recognise the delimiter and end of line characters (The control file specifies the character hex values. And when the new file is viewed in vim there are addition "^@" characters inbetween every 'normal' acsii character I would expect to see). This has led me to believe it's a character set/encoding problem. I've experimented with modifing the locale but to no avail.
So, the question is can awk/sort/sed support multi-byte character sets? (I've checked vim and it does). If so what do I need to do to allow this. If not can someone suggest an alternative approach.
Server details (i.e. uname -a):
Linux migdev 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
Many thanks for your help.
Since you mentioned vim, here is an article on converting from ucs2le to utf8 which gawk and sed should be able to handle:
In more general terms, iconv is the usual app to convert between character encodings.
it but unfortunately didn't work (the SQL*loader fails).
Basically I did the following steps:
Open the ucs-2le file in vim
Save as utf-8
Run awk to generate new file
Open new file as utf-8 in vim
Save as ucs-2le
SQL*Load the new file.
So, in general awk and sed do not support ucs2le or utf-16 encodings? Therefore to use them an explcit conversion is required?
|All times are GMT -5. The time now is 11:38 PM.|