[SOLVED] sed & cat UTF-8 files

HardenedCriminal · 08-27-2017, 09:13 PM

I have used this on ANSI and all works great but now I am putting 3 files together which are UTF-8 encoded.

What am I missing?
Do I have to tell sed or cat upfront they are using UTF-8 files on my box.

#/bin/bash
sed 's/ZZZZZZ/G1/g' _headerGnums.txt > G1_headerGnums && cat G1_headerGnums G1.txt footer.txt > G1.htm

thanks to all you experts in advance.

Turbocapitalist · 08-27-2017, 11:07 PM

Quote:

Originally Posted by HardenedCriminal

Do I have to tell sed or cat upfront they are using UTF-8 files on my box.

sed should be informed at least. What is the output of the following:

Code:

locale

It should be something like this with a lot of variables having UTF-8:

Code:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

However, sed should work with substitutions regardless. What kind of error are you actually seeing?

MadeInGermany · 08-28-2017, 04:54 AM

Character recognition depends on the locale, so for sed it generally makes sense.
Usually you simply switch to an installed UTF locale with

Code:

LC_ALL="en_US.UTF-8"

List installed (available) locales with

Code:

locale -a

HardenedCriminal · 08-28-2017, 09:14 AM

10:04:46 $$$$/1:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

but the LC_ALL is blank
Where do I fix that?
CentOS 6

I tried: LC_ALL="en_US.UTF-8"
at the top of the script no luck.

Turbocapitalist · 08-28-2017, 09:26 AM

Quote:

Originally Posted by HardenedCriminal

I tried: LC_ALL="en_US.UTF-8"
at the top of the script no luck.

Can you describe the symptoms that indicate a UTF-8 problem a little?

HardenedCriminal · 08-28-2017, 11:12 AM

The HTML file that the sed cat line makes is full of diamonds with question marks inside. If I combine the 3 files on a windows computer then the file shows the English and Greek as it should.

Linux cat:
��G1 ###<�br>� ###<�br>al'-fah ###<�br>Part of Speech: {<�a href='../RMAC/N-LI.htm'>N-LI<�/a>} {Pr} ###<�br>MLV/Definition: A> Alpha*, [3] ###<�br>Supplement: (First Greek Alphabet letter.) ###<�br>Etymology: {Heb?} ###<�br>All Compounds: mostly in G4-G895 ###<�br> ###<�br>Greek Concordance: [3] <�a href="Rev_1-8.htm">Rev_1-8<�/a>, <�a href="Rev_21-6.htm">Rev_21-6<�/a>, <�a href="Rev_22-13.htm">Rev_22-13<�/a> ###<�br> ###<�br>KJV: Alpha 4 TR: 4 ###<�br>TDNT: 1:1,* ###<�br> ###<�br>

Winblows:

÷G1 ###
Α ###
al'-fah ###
Part of Speech: {N-LI} {Pr} ###
MLV/Definition: A> Alpha*, [3] ###
Supplement: (First Greek Alphabet letter.) ###
Etymology: {Heb?} ###
All Compounds: mostly in G4-G895 ###
###
Greek Concordance: [3] Rev_1-8, Rev_21-6, Rev_22-13 ###
###
KJV: Alpha 4 TR: 4 ###
TDNT: 1:1,* ###
###

(ignore the ### they are to be replaced next step in the script)

DavidMcCann · 08-28-2017, 12:23 PM

I too am using CentOS 6. My locale is like yours, save for being en_GB. LC_ALL being blank is not a fault.

I've just created two plain text files, one Latin and one Greek, and successfully merged them with cat. I can't test sed, as I haven't the foggiest idea what it does or how to use it! Exactly how were your files created? The mark-up suggests it wasn't in a text editor. Were they created in Windows? That would create a problem which could be solved by converting the files:
iconv -f UTF16 -t UTF8 oldfile --output newfile

HardenedCriminal · 08-28-2017, 12:53 PM

any way to iconv the whole directory of .txt files first?

OR even a way to do in place in my string
sed 's/ZZZZZZ/G1/g' _headerGnums.txt > G1_headerGnums && cat G1_headerGnums G1.txt footer.txt > G1.htm

I googled and got a whole lot of the same kind of commands that give errors not results.

G1.txt is the problem child.

HardenedCriminal · 08-28-2017, 01:02 PM

Great you discovered the problem. Wonderful news.

Even though the files say they are UTF-8, apparently one of them is UTF-16.

any way to do iconv on a whole directory of files?

HardenedCriminal · 08-28-2017, 01:04 PM

I keep finding things like this:

find -name "*.txt" -exec iconv --from-code=UTF-16 --to-code=UTF-8 '{}' -o '{}' \;

and error:
iconv: incomplete character or shift sequence at end of buffer

Turbocapitalist · 08-28-2017, 01:16 PM

What is the nature of the problem file? The following should give some useful output:

Code:

file G1.txt

See "man file"

HardenedCriminal · 08-28-2017, 01:22 PM

I found a neat little fixer script and it does the trick. Thanks to all for your help.
Now how do I mark this solved?

================================
#!/bin/bash
# found here: https://unix.stackexchange.com/quest...nverted-output
#conversor.sh
#Author.....: dede.exe
#E-mail.....: dede.exe@gmail.com
#Description: Convert all files to a another format
# It's not a safe way to do it...
# Just a desperate script to save my life...
# Use it such a last resort...

to_format="utf8"
file_pattern="*.txt"

files=`find . -name "${file_pattern}"`

echo "==================== CONVERTING ===================="

#Try convert all files in the structure
for file_name in ${files}
do
#Get file format
file_format=`file $file_name --mime-encoding | cut -d":" -f2 | sed -e 's/ //g'`

if [ $file_format != $to_format ]; then

file_tmp="${unit_file}.tmp"

#Rename the file to a temporary file
mv $file_name $file_tmp

#Create a new file with a new format.
iconv -f $file_format -t $to_format $file_tmp > $file_name

#Remove the temporary file
rm $file_tmp

echo "File Name...: $file_name"
echo "From Format.: $file_format"
echo "To Format...: $to_format"
echo "---------------------------------------------------"

fi
done;
=======================================

Turbocapitalist · 08-28-2017, 01:29 PM

There's a "Thread Tools" link at the top of the posts. That will lead you to an option to mark the thread as solved.

About cut and sed, if you have the two together you might as well use awk:

Code:

file  --mime-encoding "${file_name}" | awk '{print $2}' FS=':[ :]+'

As you can see awk can even use a pattern for field separator.

Also, backticks can get hard to read and have other disadvantages. Using $( ... ) is considered a better syntax.