[SOLVED] Shell script - Problem getting and output to file with correct encoding
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Yes, it is my fault, this time kate mislead me! Also I noticed that I used pup two times, so I need to change both of them ! Result is perfect ! Thanks shruggy!
#!/usr/bin/python
from bs4 import BeautifulSoup
from urllib import request
#User agent
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:109.0) '
'Gecko/20100101 Firefox/114.0')
user_agent = {'User-Agent': agent,
'Accept': 'text/html,application/xhtml+xml,'
'application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'file:///path/to/myfile.html'
req = request.Request(url, data=None, headers={'User-Agent': user_agent})
get = request.urlopen(req)
page = get.read().decode('utf-8', 'ignore')
parse = BeautifulSoup(page, "lxml").find_all("div", attrs={"class": "box-content"})
with open('myfile.log', 'w') as f:
for i in parse:
print(i.get_text())
f.write(i.get_text())
Code:
python ./myfile.py
KAZAN
Arıza Tarihi: 23.06.2023 10:30:00
Tamir Tarihi: 23.06.2023 22:00:00
Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.
Etkilenen Yerler: Saray Mahallesi
ELMADAĞ
Arıza Tarihi: 23.06.2023 14:15:00
Tamir Tarihi: 23.06.2023 16:30:00
Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.
Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.
And the reason that I did it that way...that site is being a little obnoxious with ssl handshake. Let curl take care of it, until you learn more about python and ssl.
I specified "GÖLBAŞI" because that location is currently present there. Just replace "GÖLBAŞI" with "SİNCAN" (since you specified UTF-8, pup will understand accented characters).
tr: Ankara Su KesintiArıza Tarihi: 23.06.2023 10:30:00Tamir Tarihi: 23.06.2023 22:00:00Detay: Kahramankazan Saray Mahallesi, Bayraktar Caddesi içerisinde bulunan Ø280 mm şebeke hattında meydana gelen arıza sebebi ile kesinti yapılmaktadır.Etkilenen Yerler: Saray MahallesiArıza Tarihi: 23.06.2023 14:15:00Tamir Tarihi: 23.06.2023 16:30:00Detay: YENİPINAR MH HAYDAR MANAV CD İÇERİSİNDEKİ SU ARIZASINDAN DOLAYI 14:15 VE 16:30 SAATLERİ ARASI SU KESİNTİSİ YAPILACAKTİR.Etkilenen Yerler: YENİPINAR VE YENİDOĞAN MH.Genel Müdürlük Çalışma SaatleriSosyal MedyaUygulamalarımız
en: Ankara Water Outage Date of Failure: 23.06.2023 10:30:00 Date of Repair: 23.06.2023 22:00:00Detail: The 280 mm mains line in Kahramankazan Saray Mahallesi, Bayraktar Caddesi is being cut due to the fault that occurred. Affected Places: Saray Mahallesi Arza Date : 23.06.2023 14:15:00 Date of Repair: 23.06.2023 16:30:00Detail: YENPINAR MH HAYDAR MANAV CD WATER SHUT DOWN BETWEEN 14:15 AND 16:30 HOURS DUE TO ERSNDEK WATER FAILURE. Affected Locations: YENPINAR AND YENDOAN. General Office Working HoursOur Social Media Applications
I used the "tr" command in my code too. But as a result, it gives all the content starting from "<div class="box-content>" line to the end of the page.
So my "sed" trimming not worked but "tr" trimming worked.
Can I say that "pup" solution is better than "grep"?
Absolutely. pup is a specialized tool for parsing HTML.
BTW, add text{} at the end of your pup expression to get rid of HTML tags.
Quote:
So my "sed" trimming not worked but "tr" trimming worked.
sed is a line-oriented tool, but HTML is not a line-oriented format. In your code
Code:
sed -i 's/\n//; s/\r//g' temp.html
only the second substitution would work. Pattern space (which is usually a line of text in sed) doesn't include the closing newline character. You can make sed work across the lines (e.g. using the N command), but that's not so trivial.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.