LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-23-2023, 07:35 AM   #16
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled

pup understands option --charset:
Code:
pup --charset UTF-8 div.box-content <temp.html
 
1 members found this post helpful.
Old 06-23-2023, 07:42 AM   #17
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Quote:
Originally Posted by shruggy View Post
pup understands option --charset:
Code:
pup --charset UTF-8 div.box-content <temp.html
Thanks shruggy, I tried it but again result text has strange chars, But it is good to learn there is an such option.

Last edited by tercel; 06-23-2023 at 08:09 AM. Reason: correcting
 
Old 06-23-2023, 07:58 AM   #18
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Well, I tried it, too, and the result was the correct encoding.
Code:
$ pup --charset UTF-8 div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>
Compare it to
Code:
$ pup div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>

Last edited by shruggy; 06-23-2023 at 08:01 AM.
 
1 members found this post helpful.
Old 06-23-2023, 08:11 AM   #19
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Quote:
Originally Posted by shruggy View Post
Well, I tried it, too, and the result was the correct encoding.
Code:
$ pup --charset UTF-8 div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>
Compare it to
Code:
$ pup div.box-content <temp.html|head
<div class="box-content">
 <h4 class="heading-primary">
  <strong>
   GÖLBAŞI
  </strong>
 </h4>
 <p>
  <b>
   Arıza Tarihi:
  </b>
Yes, it is my fault, this time kate mislead me! Also I noticed that I used pup two times, so I need to change both of them ! Result is perfect ! Thanks shruggy!

Last edited by tercel; 06-23-2023 at 08:15 AM.
 
Old 06-23-2023, 08:47 AM   #20
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
I'll give you a shove.

Code:
curl -L https://www.aski.gov.tr/tr/Kesinti.aspx -o myfile.html
myfile.py
Code:
#!/usr/bin/python

from bs4 import BeautifulSoup
from urllib import request

#User agent
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:109.0) '
        'Gecko/20100101 Firefox/114.0')
        
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
        
url = 'file:///path/to/myfile.html'
        
req = request.Request(url, data=None, headers={'User-Agent': user_agent})
get = request.urlopen(req)
page = get.read().decode('utf-8', 'ignore')

parse = BeautifulSoup(page, "lxml").find_all("div", attrs={"class": "box-content"})

with open('myfile.log', 'w') as f:
    for i in parse:
        print(i.get_text())
        f.write(i.get_text())
Code:
python ./myfile.py

KAZAN

Arıza Tarihi: 23.06.2023 10:30:00
Tamir Tarihi: 23.06.2023 22:00:00
Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.
Etkilenen Yerler: Saray Mahallesi
                                                    


ELMADAĞ

Arıza Tarihi: 23.06.2023 14:15:00
Tamir Tarihi: 23.06.2023 16:30:00
Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.
Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.
And the reason that I did it that way...that site is being a little obnoxious with ssl handshake. Let curl take care of it, until you learn more about python and ssl.

Last edited by teckk; 06-23-2023 at 08:48 AM.
 
Old 06-23-2023, 09:02 AM   #21
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Thanks teckk for beautiful code and organized results ! That is the reason all masters advise me to use python!

I am now trying to get rid of the problem of "grep -o" code as you did in earlier post. I noticed that :

Code:
grep -o "<div class=\"box-content\">.*</div>" < temp.html
did not work since there is a CR or LF between them(if I am not mistaken).

So I trimmed them by using :
Code:
sed -i 's/\n//; s/\r//g' temp.html
and then try your code, but it again failed. What can be the problem with grep -o? Can you help about that too? Thanks.

Last edited by tercel; 06-23-2023 at 09:03 AM. Reason: correction
 
Old 06-23-2023, 09:04 AM   #22
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by tercel View Post
Also I noticed that I used pup two times, so I need to change both of them!
No, you don't have to use it twice. You can select the correct div.box-content at once:
Code:
pup --charset UTF-8 ':parent-of(:parent-of(:contains("GÖLBAŞI"))) text{}'
I specified "GÖLBAŞI" because that location is currently present there. Just replace "GÖLBAŞI" with "SİNCAN" (since you specified UTF-8, pup will understand accented characters).

Last edited by shruggy; 06-23-2023 at 09:26 AM.
 
1 members found this post helpful.
Old 06-23-2023, 09:33 AM   #23
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
I think that this is a problem I'm having with arch and a recent openssl update.

Can someone test this and see if you get a URLError or not. Then I will know that its that Microsoft-IIS/10.0 server and not me.
Code:
#!/usr/bin/python

from bs4 import BeautifulSoup
from urllib import request, error

#User agent
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:109.0) '
        'Gecko/20100101 Firefox/114.0')
        
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
        
url = 'https://www.aski.gov.tr/tr/Kesinti.aspx'
        
try:
    req = request.Request(url, data=None, headers=user_agent)
    get = request.urlopen(req)

except error.HTTPError as e:
    print("Http error")
except error.URLError as e:
    print('Url Error')
except TypeError as e:
    print('Type Error')
except ValueError as e:
    print("Value error")    

try:
    page = get.read().decode('utf-8', 'ignore')
    
except NameError:
    print('Name error')

parse = BeautifulSoup(page, "lxml").find_all("div", attrs={"class": "box-content"})

with open('myfile.log', 'w') as f:
    for i in parse:
        print(i.get_text())
        f.write(i.get_text())
 
Old 06-23-2023, 09:36 AM   #24
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
@teckk. I've got no problems scraping that page with your code (Debian 12).

@OP. Ah I see, GÖLBAŞI is not there already. So, try the pup selector from my previous post with KAZAN or ELMADAĞ instead.

Last edited by shruggy; 06-23-2023 at 09:39 AM.
 
Old 06-23-2023, 09:37 AM   #25
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Ok thanks. It's me, or arch that is.
 
Old 06-23-2023, 09:43 AM   #26
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
Thanks shruggy, as you said, no need to use pup twice. When I changed my code as

Code:
content=$(pup --charset UTF-8 ":parent-of(:parent-of(:contains(\"$ilce\")))" < temp.html)
echo $content
will gave the correct result ! ( I used https://github.com/ericchiang/pup/issues/79 to cope with Selector parsing error: Malformed 'contains("")' selector error of pup).
 
Old 06-23-2023, 10:01 AM   #27
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Code:
a=$(curl -L https://www.aski.gov.tr/tr/Kesinti.aspx | awk -F '[<|>]' '/b/{print $3 $5}')

echo "$a" | tr -d '\n\t\r'
 
Old 06-23-2023, 10:04 AM   #28
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
And... I wanted to know what that said
Code:
tr: Ankara Su KesintiArıza Tarihi: 23.06.2023 10:30:00Tamir Tarihi: 23.06.2023 22:00:00Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.Etkilenen Yerler: Saray MahallesiArıza Tarihi: 23.06.2023 14:15:00Tamir Tarihi: 23.06.2023 16:30:00Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.Genel Müdürlük Çalışma SaatleriSosyal MedyaUygulamalarımız

en: Ankara Water Outage Date of Failure: 23.06.2023 10:30:00 Date of Repair: 23.06.2023 22:00:00Detail: The 280 mm mains line in Kahramankazan Saray Mahallesi, Bayraktar Caddesi is being cut due to the fault that occurred. Affected Places: Saray Mahallesi Arza Date : 23.06.2023 14:15:00 Date of Repair: 23.06.2023 16:30:00Detail: YENPINAR MH HAYDAR MANAV CD WATER SHUT DOWN BETWEEN 14:15 AND 16:30 HOURS DUE TO ERSNDEK WATER FAILURE. Affected Locations: YENPINAR AND YENDOAN. General Office Working HoursOur Social Media Applications
 
Old 06-23-2023, 10:21 AM   #29
tercel
Member
 
Registered: Aug 2022
Posts: 54

Original Poster
Blog Entries: 23

Rep: Reputation: 0
I used the "tr" command in my code too. But as a result, it gives all the content starting from "<div class="box-content>" line to the end of the page.
So my "sed" trimming not worked but "tr" trimming worked.

However,
Code:
trimmedText=$(tr -d '\n\t\r' <temp.html)
echo $trimmedText > temp3.html
content=$(grep -o "<div class=\"box-content\">.*</div>" < temp3.html)
or

Code:
  
trimmedText=$(tr -d '\n\t\r' <temp.html)
echo $trimmedText > temp3.html
content=$(grep -o -P '(?<=<div class=\"box-content\">).*(?=</div>)' < temp3.html)
gave the same result, not the content between patterns, but the content upto the last closing div tag ("</div>") that is near the end of the file!

Can I say that "pup" solution is better than "grep"?

Last edited by tercel; 06-23-2023 at 10:28 AM. Reason: corrected the code
 
Old 06-24-2023, 06:20 AM   #30
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by tercel View Post
Can I say that "pup" solution is better than "grep"?
Absolutely. pup is a specialized tool for parsing HTML.

BTW, add text{} at the end of your pup expression to get rid of HTML tags.

Quote:
So my "sed" trimming not worked but "tr" trimming worked.
sed is a line-oriented tool, but HTML is not a line-oriented format. In your code
Code:
sed -i 's/\n//; s/\r//g' temp.html
only the second substitution would work. Pattern space (which is usually a line of text in sed) doesn't include the closing newline character. You can make sed work across the lines (e.g. using the N command), but that's not so trivial.
 
1 members found this post helpful.
  


Reply

Tags
curl, encoding, language, shell scripting



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Write a shell script that receives a word, an input file and an output file. The scripts copies all the lines in the input file that contain mandy2112 Linux - Newbie 3 08-18-2016 10:11 AM
[SOLVED] Shell Script not showing correct output pliqui Programming 5 04-01-2014 03:52 PM
Chinese encoding not encoding in kate linuxmandrake Linux - Software 1 12-12-2010 08:50 AM
LXer: How to make MPlayer use the correct encoding for Romanian and Greek subtitles LXer Syndicated Linux News 0 07-09-2009 06:41 PM
Correct output from script appears only when script is run interactively kaplan71 Linux - Software 2 01-15-2009 11:47 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:55 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration