LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 08-27-2017, 09:13 PM   #1
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Rep: Reputation: Disabled
sed & cat UTF-8 files


I have used this on ANSI and all works great but now I am putting 3 files together which are UTF-8 encoded.

What am I missing?
Do I have to tell sed or cat upfront they are using UTF-8 files on my box.

#/bin/bash
sed 's/ZZZZZZ/G1/g' _headerGnums.txt > G1_headerGnums && cat G1_headerGnums G1.txt footer.txt > G1.htm

thanks to all you experts in advance.
 
Old 08-27-2017, 11:07 PM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Quote:
Originally Posted by HardenedCriminal View Post
Do I have to tell sed or cat upfront they are using UTF-8 files on my box.
sed should be informed at least. What is the output of the following:

Code:
locale
It should be something like this with a lot of variables having UTF-8:

Code:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
However, sed should work with substitutions regardless. What kind of error are you actually seeing?
 
Old 08-28-2017, 04:54 AM   #3
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,780

Rep: Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198
Character recognition depends on the locale, so for sed it generally makes sense.
Usually you simply switch to an installed UTF locale with
Code:
LC_ALL="en_US.UTF-8"
List installed (available) locales with
Code:
locale -a

Last edited by MadeInGermany; 08-28-2017 at 04:55 AM.
 
Old 08-28-2017, 09:14 AM   #4
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
10:04:46 $$$$/1:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


but the LC_ALL is blank
Where do I fix that?
CentOS 6

I tried: LC_ALL="en_US.UTF-8"
at the top of the script no luck.
 
Old 08-28-2017, 09:26 AM   #5
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Quote:
Originally Posted by HardenedCriminal View Post
I tried: LC_ALL="en_US.UTF-8"
at the top of the script no luck.
Can you describe the symptoms that indicate a UTF-8 problem a little?
 
Old 08-28-2017, 11:12 AM   #6
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
The HTML file that the sed cat line makes is full of diamonds with question marks inside. If I combine the 3 files on a windows computer then the file shows the English and Greek as it should.

Linux cat:
���G1 ###<�br>� ###<�br>al'-fah ###<�br>Part of Speech: {<�a href='../RMAC/N-LI.htm'>N-LI<�/a>} {Pr} ###<�br>MLV/Definition: A> Alpha*, [3] ###<�br>Supplement: (First Greek Alphabet letter.) ###<�br>Etymology: {Heb?} ###<�br>All Compounds: mostly in G4-G895 ###<�br> ###<�br>Greek Concordance: [3] <�a href="Rev_1-8.htm">Rev_1-8<�/a>, <�a href="Rev_21-6.htm">Rev_21-6<�/a>, <�a href="Rev_22-13.htm">Rev_22-13<�/a> ###<�br> ###<�br>KJV: Alpha 4 TR: 4 ###<�br>TDNT: 1:1,* ###<�br> ###<�br> 

Winblows:

÷G1 ###
Α ###
al'-fah ###
Part of Speech: {N-LI} {Pr} ###
MLV/Definition: A> Alpha*, [3] ###
Supplement: (First Greek Alphabet letter.) ###
Etymology: {Heb?} ###
All Compounds: mostly in G4-G895 ###
###
Greek Concordance: [3] Rev_1-8, Rev_21-6, Rev_22-13 ###
###
KJV: Alpha 4 TR: 4 ###
TDNT: 1:1,* ###
###

(ignore the ### they are to be replaced next step in the script)

Last edited by HardenedCriminal; 08-28-2017 at 11:13 AM.
 
Old 08-28-2017, 12:23 PM   #7
DavidMcCann
LQ Veteran
 
Registered: Jul 2006
Location: London
Distribution: PCLinuxOS, Debian
Posts: 6,137

Rep: Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314Reputation: 2314
I too am using CentOS 6. My locale is like yours, save for being en_GB. LC_ALL being blank is not a fault.

I've just created two plain text files, one Latin and one Greek, and successfully merged them with cat. I can't test sed, as I haven't the foggiest idea what it does or how to use it! Exactly how were your files created? The mark-up suggests it wasn't in a text editor. Were they created in Windows? That would create a problem which could be solved by converting the files:
iconv -f UTF16 -t UTF8 oldfile --output newfile
 
Old 08-28-2017, 12:53 PM   #8
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
any way to iconv the whole directory of .txt files first?

OR even a way to do in place in my string
sed 's/ZZZZZZ/G1/g' _headerGnums.txt > G1_headerGnums && cat G1_headerGnums G1.txt footer.txt > G1.htm

I googled and got a whole lot of the same kind of commands that give errors not results.

G1.txt is the problem child.
 
Old 08-28-2017, 01:02 PM   #9
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
Great you discovered the problem. Wonderful news.

Even though the files say they are UTF-8, apparently one of them is UTF-16.

any way to do iconv on a whole directory of files?
 
Old 08-28-2017, 01:04 PM   #10
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
I keep finding things like this:

find -name "*.txt" -exec iconv --from-code=UTF-16 --to-code=UTF-8 '{}' -o '{}' \;

and error:
iconv: incomplete character or shift sequence at end of buffer
 
Old 08-28-2017, 01:16 PM   #11
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
What is the nature of the problem file? The following should give some useful output:

Code:
file G1.txt
See "man file"
 
Old 08-28-2017, 01:22 PM   #12
HardenedCriminal
Member
 
Registered: May 2015
Posts: 104

Original Poster
Rep: Reputation: Disabled
I found a neat little fixer script and it does the trick. Thanks to all for your help.
Now how do I mark this solved?

================================
#!/bin/bash
# found here: https://unix.stackexchange.com/quest...nverted-output
#conversor.sh
#Author.....: dede.exe
#E-mail.....: dede.exe@gmail.com
#Description: Convert all files to a another format
# It's not a safe way to do it...
# Just a desperate script to save my life...
# Use it such a last resort...

to_format="utf8"
file_pattern="*.txt"

files=`find . -name "${file_pattern}"`

echo "==================== CONVERTING ===================="

#Try convert all files in the structure
for file_name in ${files}
do
#Get file format
file_format=`file $file_name --mime-encoding | cut -d":" -f2 | sed -e 's/ //g'`

if [ $file_format != $to_format ]; then

file_tmp="${unit_file}.tmp"

#Rename the file to a temporary file
mv $file_name $file_tmp

#Create a new file with a new format.
iconv -f $file_format -t $to_format $file_tmp > $file_name

#Remove the temporary file
rm $file_tmp

echo "File Name...: $file_name"
echo "From Format.: $file_format"
echo "To Format...: $to_format"
echo "---------------------------------------------------"

fi
done;
=======================================
 
Old 08-28-2017, 01:29 PM   #13
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,294
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
There's a "Thread Tools" link at the top of the posts. That will lead you to an option to mark the thread as solved.

About cut and sed, if you have the two together you might as well use awk:

Code:
file  --mime-encoding "${file_name}" | awk '{print $2}' FS=':[ :]+'
As you can see awk can even use a pattern for field separator.

Also, backticks can get hard to read and have other disadvantages. Using $( ... ) is considered a better syntax.

Last edited by Turbocapitalist; 08-28-2017 at 01:40 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
difference :: cat file_1 & cat < f_1 tushar_pandey Linux - Newbie 4 08-16-2012 09:19 AM
POSTCONF -E /or/ SED & CAT? lovelord Linux - Server 6 04-07-2008 05:14 AM
Converting UTF-16 files to another encoding (such as UTF-8) crisostomo_enrico Solaris / OpenSolaris 3 03-25-2008 05:30 PM
less, cat, sed or what ebasi Linux - Software 8 07-28-2004 03:26 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 02:58 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration