[SOLVED] Shell script - Problem getting and output to file with correct encoding
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Shell script - Problem getting and output to file with correct encoding
Hi to all,
I want to write my own short shell script to get water shortages from web page and put them in an html file. My code worked but with exceptions:
1. My language is Turkish and page I downloaded by curl is again Turkish. First I got the file by curl command havng NO enconding problem
2. My default terminal locale encoding and language is "en_US.UTF-8". I put a command "export LANG=tr_TR.UTF-8" to change shell's language in my own script (I checked and recongiured locales by using "sudo dpkg-reconfigure locales" command and tr_TR.UTF-8 is available, but not default.
3. I got the shortages in my district and put them into a file called temp3.html
4. Encoding in temp3.html is not correct(?), although "file -mime temp3.html" gives "utf-8" but Turkish characters converted to strange characters. I used Kate editor which uses Turkish encoding by default but no success.
5. I tried to convert temp3.html file to utf-8 but not worked(by using iconv -f ... command) since file showed that it is already utf-8.
How can I get my result file temp3.html having correct Turkish characters? any help is appreciated, thanks.
I sent my code below :
Code:
#!/bin/sh
export LC_ALL=tr_TR.UTF-8
export LC_CTYPE=tr_TR.UTF-8
export LANG=tr_TR.UTF-8 # without quotes
export LANGUAGE=tr
FILE=temp.html
ilce="NCAN" # district name - since SİNCAN could not be read by script, I had to write is as "NCAN" !
if [ -f "$FILE" ]
then
#echo "$FILE exists."
fullContent="$(cat "temp.html" | pup 'div.box-content')"
fullContent=$(echo $fullContent | sed 's/<div class="box-content"> <h4 class="heading-primary">//g')
content="$(cat "temp.html" | pup 'strong text{}' | grep -i $ilce)"
if [ -n "$content" ]
then
echo $ilce " > su kesintisi var - THERE IS a water shortage !!! >>"
echo '' > temp3.html # clear the file content
delimiter="</div>"
while [ -n "$fullContent" ]
do
# ${string%%substring} --> Deletes longest match of $substring from back of $string., similar done below, so removes delimiters ...
delimited="${fullContent%%"$delimiter"*}"
delimited=$(echo "$delimited" | sed 's/<\/div>//g')
fullContent=${fullContent#*"$delimiter"}
# check if delimited contains ilce, if so write that block !
case "$delimited" in
*"$ilce"*)
echo "$delimited" # write to terminal
echo "$delimited" >> temp3.html # write to file too.
;;
esac
done
kate temp3.html # show the result file
else
echo $ilce " > su kesintisi yok - NO! water shortage >>"
fi
else
echo "$FILE does not exist."
curl -v https://www.aski.gov.tr/tr/Kesinti.aspx > temp.html
fi
As a start, you should examine the input file with a hex-viewer to find out its encoding. An option is `Midnight Commander` F3=View then F4=Hex.
I used gitview (one of gnuit tools) but I cound not understand its encoding. Because Turkish characters even did not shown in HEX view. For instance "ğ,ş,ç" are not shown there. When I used
file -mime temp.html
it says
utf-8
as eoncoding. When I used
locale charmap
it gives :
"UTF-8"
as a result. I used kate to view result file and get some strange characters such as :
"Arıza Tarihi:"
there, "ı" should be "I"(capital ı in Turkish). I tried online tools such as https://dencode.com/ to see which encoding shows the correct result, but none of them show the correct result.
I am stuck at this point.
Last edited by tercel; 06-09-2023 at 04:10 AM.
Reason: add some extra info.
Hi to all. Although I solved partially my problem, today I debug my code carefully and I noticed that, the encoding problem is due to shell command program called pup itself.
Until use it, there is no encoding problem; but after using pup command, encoding problem arouse. Is there anybody who can enlight me about the relationship between pup program and encoding? Thanks.
do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.
do you mean this? https://github.com/ericchiang/pup. It is abandoned.
I guess it is a problem with the locale, but you might need to replace this tool with something better.
Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners Do you know any alternative for it?
echo -e "I used kate to view result file and get some strange characters such as \
Sonuç dosyasını görüntülemek ve bazı garip karakterler elde \
etmek için kate kullandım" > myfile.txt
file myfile.txt
myfile.txt: Unicode text, UTF-8 text
cat myfile.txt | fold -sw 70
I used kate to view result file and get some strange characters such
as Sonuç dosyasını görüntülemek ve bazı garip karakterler elde
etmek için kate kullandım
Code:
curl -L https://www.hurriyet.com.tr/ -o myfile.html
file myfile.html
myfile.html: HTML document, Unicode text, UTF-8 text, with very long lines (65518), with no line terminators
cat myfile.html | grep -o '<p>.*</p>' | fold -sw 70
<p>Türkiye'den ve Dünya’dan son dakika haberleri, köşe
yazıları, magazinden siyasete, spordan seyahate bütün konuların
tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak
gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz
olarak kopyalanamaz, başka yerde yayınlanamaz.</p>
What is the url the you are trying to scrape? When scraping, python is what first comes to mind.
tr: Türkiye'den ve Dünya’dan son dakika haberleri, köşe yazıları, magazinden siyasete, spordan seyahate bütün konuların tek adresi Hurriyet.com.tr haber içerikleri izin alınmadan, kaynak gösterilerek dahi iktibas edilemez. Kanuna aykırı ve izinsiz olarak kopyalanamaz, başka yerde yayınlanamaz.
en: Hurriyet.com.tr, the only address for all topics from Turkey and the World, breaking news, articles, magazines to politics, sports to travel, news content cannot be quoted without permission, even by citing the source.
Oh my god! Yes, that is the tool I used. Thanks. I used "sed", "cut" and "grep" to do the same but not so easy for beginners Do you know any alternative for it?
yes, nowadays python is the recommended way, just [obviously] you need to learn it.
while it shouldn't have given blank line, it gave !
Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting
Last edited by tercel; 06-23-2023 at 07:13 AM.
Reason: grammer mistake?
while it shouldn't have given blank line, it gave !
Python, no problem in learning but, I think "learning a new language" is not an option for every situation. Every such suggestions reminds me the joke I heard "I need to extract 3^47 from 2^48 in shell, how can I do it " >> you need to learn python !" I am now learning shell scripting
Obviously you are right, there is no need to learn a new language, if you can solve it. Unfortunately parsing html is not easy, therefore better to use a professional parser (which is already available, free and also working well). We have no real parser in bash (grep, awk, sed), but in perl, python and java (for example).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.