LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-05-2010, 02:36 AM   #1
Jem7v!
LQ Newbie
 
Registered: Feb 2010
Posts: 2

Rep: Reputation: 0
Problems using awk/sed/sort with a ucs-2le encoded file


Hello

I'm having lots of fun and games trying to use (g)awk (and sed, sort) with a file encoded in ucs-2le.
The overall command looks like this:
sed '1d' ./ucs-2le_file.txt | sort -t '' -n -k 2,2 -k 3,3 -k 5,5 -T . | awk -F"" -f aggregate.awk > new_file.txt

Ideally I would like the new file to be created with the same ucs-2le encoding but I don't think it is.

To give some background:
Previously a large file encoded in ucs-2le was FTP'd to the server and then loaded into an Oracle table using SQL*loader and using a UTF16 character set (parameter in the control file)
To improve performance I'm trying to remove and aggregate data within the file so the SQL*Load and the subsequent SQL has less data to play with.
I'm therefore trying to use the existing process but adding an additional step to create a smaller file (using awk/sort/sed as above) and use the same SQL*loader control file to load the new file with the reduced dataset.
Unforunately after the new file has been created the SQL*loader part fails because it can no longer recognise the delimiter and end of line characters (The control file specifies the character hex values. And when the new file is viewed in vim there are addition "^@" characters inbetween every 'normal' acsii character I would expect to see). This has led me to believe it's a character set/encoding problem. I've experimented with modifing the locale but to no avail.

So, the question is can awk/sort/sed support multi-byte character sets? (I've checked vim and it does). If so what do I need to do to allow this. If not can someone suggest an alternative approach.

Server details (i.e. uname -a):
Linux migdev 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Many thanks for your help.
 
Old 02-05-2010, 02:41 AM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Since you mentioned vim, here is an article on converting from ucs2le to utf8 which gawk and sed should be able to handle:
http://krzysztofcierpisz.blogspot.co...icode-ucs.html
 
Old 02-05-2010, 03:42 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
In more general terms, iconv is the usual app to convert between character encodings.

Code:
iconv -f UCS-2LE" -t UTF-8 input_file
 
Old 02-05-2010, 06:03 AM   #4
Jem7v!
LQ Newbie
 
Registered: Feb 2010
Posts: 2

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jschiwal View Post
Since you mentioned vim, here is an article on converting from ucs2le to utf8 which gawk and sed should be able to handle:
http://krzysztofcierpisz.blogspot.co...icode-ucs.html
Thanks for the response. I had already seen that web page and had tried
it but unfortunately didn't work (the SQL*loader fails).
Basically I did the following steps:
Open the ucs-2le file in vim
Save as utf-8
Run awk to generate new file
Open new file as utf-8 in vim
Save as ucs-2le
SQL*Load the new file.

So, in general awk and sed do not support ucs2le or utf-16 encodings? Therefore to use them an explcit conversion is required?

Thanks again.
 
  


Reply

Tags
awk, encoding, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using sed or awk to filter a file mitchell2345 Linux - Software 3 04-10-2009 10:07 AM
grep+awk+sed+paste+sort in one script? mchriste Linux - Software 13 03-05-2009 01:57 PM
sed/awk sort help Kvetch Programming 17 08-30-2006 07:21 PM
how to delete duplicates entries in xml file using sed/awk/sort ? catzilla Linux - Software 1 10-28-2005 02:57 PM
How to loop or sort in bash, awk or sed? j4r0d Programming 1 09-09-2004 03:22 AM


All times are GMT -5. The time now is 10:52 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration