[SOLVED] Shell script - Problem getting and output to file with correct encoding

tercel · 06-08-2023, 02:33 PM

Hi to all,
I want to write my own short shell script to get water shortages from web page and put them in an html file. My code worked but with exceptions:
1. My language is Turkish and page I downloaded by curl is again Turkish. First I got the file by curl command havng NO enconding problem
2. My default terminal locale encoding and language is "en_US.UTF-8". I put a command "export LANG=tr_TR.UTF-8" to change shell's language in my own script (I checked and recongiured locales by using "sudo dpkg-reconfigure locales" command and tr_TR.UTF-8 is available, but not default.
3. I got the shortages in my district and put them into a file called temp3.html
4. Encoding in temp3.html is not correct(?), although "file -mime temp3.html" gives "utf-8" but Turkish characters converted to strange characters. I used Kate editor which uses Turkish encoding by default but no success.
5. I tried to convert temp3.html file to utf-8 but not worked(by using iconv -f ... command) since file showed that it is already utf-8.
How can I get my result file temp3.html having correct Turkish characters? any help is appreciated, thanks.

I sent my code below :

Code:

#!/bin/sh

export LC_ALL=tr_TR.UTF-8
export LC_CTYPE=tr_TR.UTF-8
export LANG=tr_TR.UTF-8 # without quotes
export LANGUAGE=tr

FILE=temp.html

ilce="NCAN" # district name - since SİNCAN could not be read by script, I had to write is as "NCAN" !

if [ -f "$FILE" ] 
  then
    #echo "$FILE exists."
    fullContent="$(cat "temp.html" | pup 'div.box-content')"
    fullContent=$(echo $fullContent | sed 's/<div class="box-content"> <h4 class="heading-primary">//g') 
    content="$(cat "temp.html" | pup 'strong text{}' | grep -i $ilce)"
    
    if [ -n "$content" ]  
      then
        echo $ilce " > su kesintisi var - THERE IS a water shortage !!! >>"
        echo '' > temp3.html # clear the file content
        delimiter="</div>"
        while [ -n "$fullContent" ] 
          do
            # ${string%%substring}  --> Deletes longest match of $substring from back of $string., similar done below, so removes delimiters ...
            delimited="${fullContent%%"$delimiter"*}"  
            delimited=$(echo "$delimited" | sed 's/<\/div>//g')
            fullContent=${fullContent#*"$delimiter"}
            # check if delimited contains ilce, if so write that block ! 
            case "$delimited" in
              *"$ilce"*)
                echo "$delimited" # write to terminal 
                echo "$delimited" >> temp3.html # write to file too.
                ;;
            esac
        done
        kate temp3.html  # show the result file
      else
        echo $ilce " > su kesintisi yok - NO! water shortage >>"
    fi
  else 
    echo "$FILE does not exist."
    curl -v https://www.aski.gov.tr/tr/Kesinti.aspx > temp.html
fi

NevemTeve · 06-09-2023, 12:37 AM

As a start, you should examine the input file with a hex-viewer to find out its encoding. An option is `Midnight Commander` F3=View then F4=Hex.

tercel · 06-09-2023, 04:02 AM

Quote:

Originally Posted by NevemTeve

As a start, you should examine the input file with a hex-viewer to find out its encoding. An option is `Midnight Commander` F3=View then F4=Hex.

I used gitview (one of gnuit tools) but I cound not understand its encoding. Because Turkish characters even did not shown in HEX view. For instance "ğ,ş,ç" are not shown there. When I used

file -mime temp.html

it says

utf-8

as eoncoding. When I used

locale charmap

it gives :

"UTF-8"
as a result. I used kate to view result file and get some strange characters such as :

"ArÄ±za Tarihi:"

there, "Ä±" should be "I"(capital ı in Turkish). I tried online tools such as https://dencode.com/ to see which encoding shows the correct result, but none of them show the correct result.
I am stuck at this point.

pan64 · 06-09-2023, 04:29 AM

that's why you need to use a hex viewer to check the real content of the file. And you will know if that is correct and kate was wrong (or not).

tercel · 06-09-2023, 04:41 AM

In hex code , I found that "84c2b1" as "Ä±" ( also checked with https://www.binaryhexconverter.com/h...text-converter). Small "ı" should be "C4 B1" and capital "I" should be "20 49" in hexadecimal(I used https://www.rapidtables.com/convert/...ii-to-hex.html)

So, how can I learn what is the encoding and how to convert it to correct encoding? Do I need to write another shell script for converting all them?

tercel · 06-09-2023, 04:56 AM

OK, I found the solution by reading https://www.linuxquestions.org/quest...ii-4175521054/ :

Code:

iconv -f utf-8 -t windows-1254 temp3.html -o temp4.html

solves my problem. Thanks for guiding.

tercel · 06-21-2023, 07:34 AM

Hi to all. Although I solved partially my problem, today I debug my code carefully and I noticed that, the encoding problem is due to shell command program called pup itself.
Until use it, there is no encoding problem; but after using pup command, encoding problem arouse. Is there anybody who can enlight me about the relationship between pup program and encoding? Thanks.

pan64 · 06-21-2023, 08:06 AM

do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.

tercel · 06-22-2023, 12:13 PM

Quote:

Originally Posted by pan64

do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.

Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners

Do you know any alternative for it?

NevemTeve · 06-22-2023, 12:39 PM

First you should find out what the actual problem is. That requires a hexviewer and some manual work.

teckk · 06-22-2023, 04:57 PM

Code:

echo -e "I used kate to view result file and get some strange characters such as \
Sonuç dosyasını görüntülemek ve bazı garip karakterler elde \
etmek için kate kullandım" > myfile.txt

file myfile.txt
myfile.txt: Unicode text, UTF-8 text

cat myfile.txt | fold -sw 70
I used kate to view result file and get some strange characters such 
as Sonuç dosyasını görüntülemek ve bazı garip karakterler elde 
etmek için kate kullandım

Code:

curl -L https://www.hurriyet.com.tr/ -o myfile.html

file myfile.html
myfile.html: HTML document, Unicode text, UTF-8 text, with very long lines (65518), with no line terminators

cat myfile.html | grep -o '<p>.*</p>' | fold -sw 70
<p>Türkiye'den ve Dünya’dan son dakika haberleri, köşe 
yazıları, magazinden siyasete, spordan seyahate bütün konuların 
tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak 
gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz 
olarak kopyalanamaz, başka yerde yayınlanamaz.</p>

What is the url the you are trying to scrape? When scraping, python is what first comes to mind.

teckk · 06-22-2023, 05:01 PM

Maybe I should translate that for the forum

Code:

tr: Türkiye'den ve Dünya’dan son dakika haberleri, köşe yazıları, magazinden siyasete, spordan seyahate bütün konuların tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz olarak kopyalanamaz, başka yerde yayınlanamaz.
en: Hurriyet.com.tr, the only address for all topics from Turkey and the World, breaking news, articles, magazines to politics, sports to travel, news content cannot be quoted without permission, even by citing the source.

pan64 · 06-23-2023, 12:50 AM

Quote:

Originally Posted by tercel

Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners

Do you know any alternative for it?

yes, nowadays python is the recommended way, just [obviously] you need to learn it.

tercel · 06-23-2023, 07:12 AM

Quote:

What is the url the you are trying to scrape? When scraping, python is what first comes to mind.

The url that is in my code of my original post-second line from bottom- was:

https://www.aski.gov.tr/tr/Kesinti.aspx

I tried the similar code as you wrote to fetch the "div" tag contains water shortages :

Code:

grep -o "<div class=\"box-content\">.*</div>" < temp.html

while it shouldn't have given blank line, it gave !

Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting

pan64 · 06-23-2023, 07:26 AM

Quote:

Originally Posted by tercel

The url that is in my code of my original post-second line from bottom- was:

https://www.aski.gov.tr/tr/Kesinti.aspx

I tried the similar code as you wrote to fetch the "div" tag contains water shortages :

Code:

grep -o "<div class=\"box-content\">.*</div>" < temp.html

while it shouldn't have given blank line, it gave !

Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting

Obviously you are right, there is no need to learn a new language, if you can solve it. Unfortunately parsing html is not easy, therefore better to use a professional parser (which is already available, free and also working well). We have no real parser in bash (grep, awk, sed), but in perl, python and java (for example).