LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-08-2023, 02:33 PM   #1
tercel
Member
 
Registered: Aug 2022
Posts: 54
Blog Entries: 23

Rep: Reputation: 0
Post Shell script - Problem getting and output to file with correct encoding


Hi to all,
I want to write my own short shell script to get water shortages from web page and put them in an html file. My code worked but with exceptions:
1. My language is Turkish and page I downloaded by curl is again Turkish. First I got the file by curl command havng NO enconding problem
2. My default terminal locale encoding and language is "en_US.UTF-8". I put a command "export LANG=tr_TR.UTF-8" to change shell's language in my own script (I checked and recongiured locales by using "sudo dpkg-reconfigure locales" command and tr_TR.UTF-8 is available, but not default.
3. I got the shortages in my district and put them into a file called temp3.html
4. Encoding in temp3.html is not correct(?), although "file -mime temp3.html" gives "utf-8" but Turkish characters converted to strange characters. I used Kate editor which uses Turkish encoding by default but no success.
5. I tried to convert temp3.html file to utf-8 but not worked(by using iconv -f ... command) since file showed that it is already utf-8.
How can I get my result file temp3.html having correct Turkish characters? any help is appreciated, thanks.

I sent my code below :

Code:
#!/bin/sh

export LC_ALL=tr_TR.UTF-8
export LC_CTYPE=tr_TR.UTF-8
export LANG=tr_TR.UTF-8 # without quotes
export LANGUAGE=tr

FILE=temp.html

ilce="NCAN" # district name - since SİNCAN could not be read by script, I had to write is as "NCAN" !

if [ -f "$FILE" ] 
  then
    #echo "$FILE exists."
    fullContent="$(cat "temp.html" | pup 'div.box-content')"
    fullContent=$(echo $fullContent | sed 's/<div class="box-content"> <h4 class="heading-primary">//g') 
    content="$(cat "temp.html" | pup 'strong text{}' | grep -i $ilce)"
    
    if [ -n "$content" ]  
      then
        echo $ilce " > su kesintisi var - THERE IS a water shortage !!! >>"
        echo '' > temp3.html # clear the file content
        delimiter="</div>"
        while [ -n "$fullContent" ] 
          do
            # ${string%%substring}  --> Deletes longest match of $substring from back of $string., similar done below, so removes delimiters ...
            delimited="${fullContent%%"$delimiter"*}"  
            delimited=$(echo "$delimited" | sed 's/<\/div>//g')
            fullContent=${fullContent#*"$delimiter"}
            # check if delimited contains ilce, if so write that block ! 
            case "$delimited" in
              *"$ilce"*)
                echo "$delimited" # write to terminal 
                echo "$delimited" >> temp3.html # write to file too.
                ;;
            esac
        done
        kate temp3.html  # show the result file
      else
        echo $ilce " > su kesintisi yok - NO! water shortage >>"
    fi
  else 
    echo "$FILE does not exist."
    curl -v https://www.aski.gov.tr/tr/Kesinti.aspx > temp.html
fi
 
Old 06-09-2023, 12:37 AM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
As a start, you should examine the input file with a hex-viewer to find out its encoding. An option is `Midnight Commander` F3=View then F4=Hex.
 
Old 06-09-2023, 04:02 AM   #3
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Quote:
Originally Posted by NevemTeve View Post
As a start, you should examine the input file with a hex-viewer to find out its encoding. An option is `Midnight Commander` F3=View then F4=Hex.
I used gitview (one of gnuit tools) but I cound not understand its encoding. Because Turkish characters even did not shown in HEX view. For instance "ğ,ş,ç" are not shown there. When I used

file -mime temp.html

it says

utf-8

as eoncoding. When I used

locale charmap

it gives :

"UTF-8"
as a result. I used kate to view result file and get some strange characters such as :

"Arıza Tarihi:"

there, "ı" should be "I"(capital ı in Turkish). I tried online tools such as https://dencode.com/ to see which encoding shows the correct result, but none of them show the correct result.
I am stuck at this point.

Last edited by tercel; 06-09-2023 at 04:10 AM. Reason: add some extra info.
 
Old 06-09-2023, 04:29 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
that's why you need to use a hex viewer to check the real content of the file. And you will know if that is correct and kate was wrong (or not).
 
Old 06-09-2023, 04:41 AM   #5
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
In hex code , I found that "84c2b1" as "ı" ( also checked with https://www.binaryhexconverter.com/h...text-converter). Small "ı" should be "C4 B1" and capital "I" should be "20 49" in hexadecimal(I used https://www.rapidtables.com/convert/...ii-to-hex.html)

So, how can I learn what is the encoding and how to convert it to correct encoding? Do I need to write another shell script for converting all them?
 
Old 06-09-2023, 04:56 AM   #6
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
OK, I found the solution by reading https://www.linuxquestions.org/quest...ii-4175521054/ :

Code:
iconv -f utf-8 -t windows-1254 temp3.html -o temp4.html
solves my problem. Thanks for guiding.
 
Old 06-21-2023, 07:34 AM   #7
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Hi to all. Although I solved partially my problem, today I debug my code carefully and I noticed that, the encoding problem is due to shell command program called pup itself.
Until use it, there is no encoding problem; but after using pup command, encoding problem arouse. Is there anybody who can enlight me about the relationship between pup program and encoding? Thanks.
 
Old 06-21-2023, 08:06 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.
 
1 members found this post helpful.
Old 06-22-2023, 12:13 PM   #9
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Quote:
Originally Posted by pan64 View Post
do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.
Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners Do you know any alternative for it?
 
Old 06-22-2023, 12:39 PM   #10
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
First you should find out what the actual problem is. That requires a hexviewer and some manual work.
 
Old 06-22-2023, 04:57 PM   #11
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Code:
echo -e "I used kate to view result file and get some strange characters such as \
Sonuç dosyasını görüntülemek ve bazı garip karakterler elde \
etmek için kate kullandım" > myfile.txt

file myfile.txt
myfile.txt: Unicode text, UTF-8 text

cat myfile.txt | fold -sw 70
I used kate to view result file and get some strange characters such 
as Sonuç dosyasını görüntülemek ve bazı garip karakterler elde 
etmek için kate kullandım

Code:
curl -L https://www.hurriyet.com.tr/ -o myfile.html

file myfile.html
myfile.html: HTML document, Unicode text, UTF-8 text, with very long lines (65518), with no line terminators

cat myfile.html | grep -o '<p>.*</p>' | fold -sw 70
<p>Türkiye'den ve Dünya’dan son dakika haberleri, köşe 
yazıları, magazinden siyasete, spordan seyahate bütün konuların 
tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak 
gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz 
olarak kopyalanamaz, başka yerde yayınlanamaz.</p>
What is the url the you are trying to scrape? When scraping, python is what first comes to mind.
 
1 members found this post helpful.
Old 06-22-2023, 05:01 PM   #12
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Maybe I should translate that for the forum
Code:
tr: Türkiye'den ve Dünya’dan son dakika haberleri, köşe yazıları, magazinden siyasete, spordan seyahate bütün konuların tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz olarak kopyalanamaz, başka yerde yayınlanamaz.
en: Hurriyet.com.tr, the only address for all topics from Turkey and the World, breaking news, articles, magazines to politics, sports to travel, news content cannot be quoted without permission, even by citing the source.
 
Old 06-23-2023, 12:50 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
Quote:
Originally Posted by tercel View Post
Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners Do you know any alternative for it?
yes, nowadays python is the recommended way, just [obviously] you need to learn it.
 
Old 06-23-2023, 07:12 AM   #14
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Quote:
What is the url the you are trying to scrape? When scraping, python is what first comes to mind.
The url that is in my code of my original post-second line from bottom- was:

https://www.aski.gov.tr/tr/Kesinti.aspx


I tried the similar code as you wrote to fetch the "div" tag contains water shortages :
Code:
grep -o "<div class=\"box-content\">.*</div>" < temp.html
while it shouldn't have given blank line, it gave !

Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting

Last edited by tercel; 06-23-2023 at 07:13 AM. Reason: grammer mistake?
 
Old 06-23-2023, 07:26 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
Quote:
Originally Posted by tercel View Post
The url that is in my code of my original post-second line from bottom- was:

https://www.aski.gov.tr/tr/Kesinti.aspx


I tried the similar code as you wrote to fetch the "div" tag contains water shortages :
Code:
grep -o "<div class=\"box-content\">.*</div>" < temp.html
while it shouldn't have given blank line, it gave !

Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting
Obviously you are right, there is no need to learn a new language, if you can solve it. Unfortunately parsing html is not easy, therefore better to use a professional parser (which is already available, free and also working well). We have no real parser in bash (grep, awk, sed), but in perl, python and java (for example).
 
1 members found this post helpful.
  


Reply

Tags
curl, encoding, language, shell scripting



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Write a shell script that receives a word, an input file and an output file. The scripts copies all the lines in the input file that contain mandy2112 Linux - Newbie 3 08-18-2016 10:11 AM
[SOLVED] Shell Script not showing correct output pliqui Programming 5 04-01-2014 03:52 PM
Chinese encoding not encoding in kate linuxmandrake Linux - Software 1 12-12-2010 08:50 AM
LXer: How to make MPlayer use the correct encoding for Romanian and Greek subtitles LXer Syndicated Linux News 0 07-09-2009 06:41 PM
Correct output from script appears only when script is run interactively kaplan71 Linux - Software 2 01-15-2009 11:47 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration