[SOLVED] Shell script - Problem getting and output to file with correct encoding

shruggy · 06-23-2023, 07:35 AM

pup understands option --charset:

Code:

pup --charset UTF-8 div.box-content <temp.html

tercel · 06-23-2023, 07:42 AM

Quote:

Originally Posted by shruggy

pup understands option --charset:

Code:

pup --charset UTF-8 div.box-content <temp.html

Thanks shruggy, I tried it but again result text has strange chars, But it is good to learn there is an such option.

shruggy · 06-23-2023, 07:58 AM

Well, I tried it, too, and the result was the correct encoding.

Code:

$ pup --charset UTF-8 div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>

Compare it to

Code:

$ pup div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÃ–LBAÅžI
  </strong>
 </h4>
 <p>
  <b>
   ArÄ±za Tarihi:
  </b>

tercel · 06-23-2023, 08:11 AM

Quote:

Originally Posted by shruggy

Well, I tried it, too, and the result was the correct encoding.

Code:

$ pup --charset UTF-8 div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>

Compare it to

Code:

$ pup div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÃ–LBAÅžI
  </strong>
 </h4>
 <p>
  <b>
   ArÄ±za Tarihi:
  </b>

Yes, it is my fault, this time kate mislead me! Also I noticed that I used pup two times, so I need to change both of them ! Result is perfect ! Thanks shruggy!

teckk · 06-23-2023, 08:47 AM

I'll give you a shove.

Code:

curl -L https://www.aski.gov.tr/tr/Kesinti.aspx -o myfile.html

myfile.py

Code:

#!/usr/bin/python

from bs4 import BeautifulSoup
from urllib import request

#User agent
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:109.0) '
        'Gecko/20100101 Firefox/114.0')
        
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
        
url = 'file:///path/to/myfile.html'
        
req = request.Request(url, data=None, headers={'User-Agent': user_agent})
get = request.urlopen(req)
page = get.read().decode('utf-8', 'ignore')

parse = BeautifulSoup(page, "lxml").find_all("div", attrs={"class": "box-content"})

with open('myfile.log', 'w') as f:
    for i in parse:
        print(i.get_text())
        f.write(i.get_text())

Code:

python ./myfile.py

KAZAN

Arıza Tarihi: 23.06.2023 10:30:00
Tamir Tarihi: 23.06.2023 22:00:00
Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.
Etkilenen Yerler: Saray Mahallesi
                                                    


ELMADAĞ

Arıza Tarihi: 23.06.2023 14:15:00
Tamir Tarihi: 23.06.2023 16:30:00
Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.
Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.

And the reason that I did it that way...that site is being a little obnoxious with ssl handshake. Let curl take care of it, until you learn more about python and ssl.

tercel · 06-23-2023, 09:02 AM

Thanks teckk for beautiful code and organized results ! That is the reason all masters advise me to use python!

I am now trying to get rid of the problem of "grep -o" code as you did in earlier post. I noticed that :

Code:

grep -o "<div class=\"box-content\">.*</div>" < temp.html

did not work since there is a CR or LF between them(if I am not mistaken).

So I trimmed them by using :

Code:

sed -i 's/\n//; s/\r//g' temp.html

and then try your code, but it again failed. What can be the problem with grep -o? Can you help about that too? Thanks.

shruggy · 06-23-2023, 09:04 AM

Quote:

Originally Posted by tercel

Also I noticed that I used pup two times, so I need to change both of them!

No, you don't have to use it twice. You can select the correct div.box-content at once:

Code:

pup --charset UTF-8 ':parent-of(:parent-of(:contains("GÖLBAŞI"))) text{}'

I specified "GÖLBAŞI" because that location is currently present there. Just replace "GÖLBAŞI" with "SİNCAN" (since you specified UTF-8, pup will understand accented characters).

teckk · 06-23-2023, 09:33 AM

I think that this is a problem I'm having with arch and a recent openssl update.

Can someone test this and see if you get a URLError or not. Then I will know that its that Microsoft-IIS/10.0 server and not me.

Code:

#!/usr/bin/python

from bs4 import BeautifulSoup
from urllib import request, error

#User agent
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:109.0) '
        'Gecko/20100101 Firefox/114.0')
        
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
        
url = 'https://www.aski.gov.tr/tr/Kesinti.aspx'
        
try:
    req = request.Request(url, data=None, headers=user_agent)
    get = request.urlopen(req)

except error.HTTPError as e:
    print("Http error")
except error.URLError as e:
    print('Url Error')
except TypeError as e:
    print('Type Error')
except ValueError as e:
    print("Value error")    

try:
    page = get.read().decode('utf-8', 'ignore')
    
except NameError:
    print('Name error')

parse = BeautifulSoup(page, "lxml").find_all("div", attrs={"class": "box-content"})

with open('myfile.log', 'w') as f:
    for i in parse:
        print(i.get_text())
        f.write(i.get_text())

shruggy · 06-23-2023, 09:36 AM

@teckk. I've got no problems scraping that page with your code (Debian 12).

@OP. Ah I see, GÖLBAŞI is not there already. So, try the pup selector from my previous post with KAZAN or ELMADAĞ instead.

teckk · 06-23-2023, 09:37 AM

Ok thanks. It's me, or arch that is.

tercel · 06-23-2023, 09:43 AM

Thanks shruggy, as you said, no need to use pup twice. When I changed my code as

Code:

content=$(pup --charset UTF-8 ":parent-of(:parent-of(:contains(\"$ilce\")))" < temp.html)
echo $content

will gave the correct result ! ( I used https://github.com/ericchiang/pup/issues/79 to cope with Selector parsing error: Malformed 'contains("")' selector error of pup).

teckk · 06-23-2023, 10:01 AM

Code:

a=$(curl -L https://www.aski.gov.tr/tr/Kesinti.aspx | awk -F '[<|>]' '/b/{print $3 $5}')

echo "$a" | tr -d '\n\t\r'

teckk · 06-23-2023, 10:04 AM

And... I wanted to know what that said

Code:

tr: Ankara Su KesintiArıza Tarihi: 23.06.2023 10:30:00Tamir Tarihi: 23.06.2023 22:00:00Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.Etkilenen Yerler: Saray MahallesiArıza Tarihi: 23.06.2023 14:15:00Tamir Tarihi: 23.06.2023 16:30:00Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.Genel Müdürlük Çalışma SaatleriSosyal MedyaUygulamalarımız

en: Ankara Water Outage Date of Failure: 23.06.2023 10:30:00 Date of Repair: 23.06.2023 22:00:00Detail: The 280 mm mains line in Kahramankazan Saray Mahallesi, Bayraktar Caddesi is being cut due to the fault that occurred. Affected Places: Saray Mahallesi Arza Date : 23.06.2023 14:15:00 Date of Repair: 23.06.2023 16:30:00Detail: YENPINAR MH HAYDAR MANAV CD WATER SHUT DOWN BETWEEN 14:15 AND 16:30 HOURS DUE TO ERSNDEK WATER FAILURE. Affected Locations: YENPINAR AND YENDOAN. General Office Working HoursOur Social Media Applications

tercel · 06-23-2023, 10:21 AM

I used the "tr" command in my code too. But as a result, it gives all the content starting from "<div class="box-content>" line to the end of the page.
So my "sed" trimming not worked but "tr" trimming worked.

However,

Code:

trimmedText=$(tr -d '\n\t\r' <temp.html)
echo $trimmedText > temp3.html
content=$(grep -o "<div class=\"box-content\">.*</div>" < temp3.html)

or

Code:

  
trimmedText=$(tr -d '\n\t\r' <temp.html)
echo $trimmedText > temp3.html
content=$(grep -o -P '(?<=<div class=\"box-content\">).*(?=</div>)' < temp3.html)

gave the same result, not the content between patterns, but the content upto the last closing div tag ("</div>") that is near the end of the file!

Can I say that "pup" solution is better than "grep"?

shruggy · 06-24-2023, 06:20 AM

Quote:

Originally Posted by tercel

Can I say that "pup" solution is better than "grep"?

Absolutely. pup is a specialized tool for parsing HTML.

BTW, add text{} at the end of your pup expression to get rid of HTML tags.

Quote:

So my "sed" trimming not worked but "tr" trimming worked.

sed is a line-oriented tool, but HTML is not a line-oriented format. In your code

Code:

sed -i 's/\n//; s/\r//g' temp.html

only the second substitution would work. Pattern space (which is usually a line of text in sed) doesn't include the closing newline character. You can make sed work across the lines (e.g. using the N command), but that's not so trivial.