LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices


Reply
  Search this Thread
Old 06-12-2021, 02:31 AM   #1
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
[Right and Rules] web scraping


Good morning

In a previous thread on a different topic, shruggy had pointed out that one interpretation of your Digital Millennium Copyright Act (DMCA) (see there) could prevent us from using a Web-content in a way that we prefer from what the site-owner had imagined – in the USA that is.

Web-scraping, in contrast, is defined as convenes the person or organism who publishes the definition – something between a way to do serious work and blatant net-abuse (my words).

Now it seems that ... I do something like that. What is your opinion?

It goes like this: I am quite seized up on the ill-conceived Web-sites of my favorite radio-stations – not the only topic I post on LQ, but I am approaching saturation.
As a means to avoid their “friends” at diverse profiling- and tracing-companies, I have automated the retrieval of their RSS data (XML). By doing so, I have the direct URLs to their recent and even quite old broadcasts and can download them any time serenely without consulting the infested Web-site.

When I have *asked* the site-operator if they could just publish with their existing list of broadcasts a flat URL for each RSS, none of the subsidiaries of Radio France ever cared to respond.
I wrote a Web-bot which assembles these data now, but had to run only once to establish the complete list.

So far, I only use the data which is already present on the same page – be it after the click on a button which loads the URL for an RSS-stream dynamically... what rubbish! (But I can work with it).

But now, I noticed that there are so many broadcasts that I did not know existed, but look interesting. The short description lacks on the list and so my bot is obliged to open a second browser-tab for each single broadcast, get the description and close the tab before proceeding to the next item. <=== THIS IS IT.

Would you say this last step, where I combine information from two different pages to produce 1 list for private use were Web-scraping and what kind of Web-scraping... some evil terrorism thing or just “mastering the tools of the Web”?

The legislation in Germany is similar in that it is not clear what were “allowed”, “forbidden” or just “impertinent”. I somehow do not feel inclined to ask about the French law. The authorities here have declared themselves incompetent in the domains that had been attributed to them. It would be awfully difficult to avoid pitfalls and misunderstanding.

PS.: I forgot. The page, where I describe the RSS munging; scripts are commented in English. And here is one of those lists that my Web-bot created. There is no hyperlink in this file, the numbers identify a RSS each.

Last edited by Michael Uplawski; 06-12-2021 at 02:50 AM. Reason: words, punctuation and stuff.
 
Old 06-12-2021, 04:16 AM   #2
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 17,413
Blog Entries: 10

Rep: Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225
Quote:
Originally Posted by Michael Uplawski View Post
one interpretation of your Digital Millennium Copyright Act (DMCA) (see there) could prevent us from using a Web-content in a way that we prefer from what the site-owner had imagined in the USA that is.

Web-scraping, in contrast, is defined as convenes the person or organism who publishes the definition something between a way to do serious work and blatant net-abuse (my words).
I like how the youtube-dl-gate ended; essentially, in a legally binding agreement, they were allowed to continue because youtube-dl works like a different kind of web browser, it doesn't abuse anything. It's not illegal to use youtube-dl to scrape youtube pages for links to watch videos.
Can't find the document/article right now, but that was it in layman terms.

Quote:
Originally Posted by Michael Uplawski View Post
I think that page is broken.
Quote:
Originally Posted by Michael Uplawski View Post
And here is one of those lists that my Web-bot created. There is no hyperlink in this file, the numbers identify a RSS each.
Even so, I wouldn't publish it. It's one thing to use the software, and another to publish its results.
 
Old 06-12-2021, 05:56 AM   #3
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169

Original Poster
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
Quote:
Originally Posted by ondoho View Post
I think that page is broken.
Thanks for your remarks. In which way the page is broken; I just checked and cannot see anything odd, apart from my own choices of styles and stuff.
 
Old 06-13-2021, 02:35 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 17,413
Blog Entries: 10

Rep: Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225Reputation: 5225
On FF-ESR with all addons disabled.
It works fine on a fresh profile though; I'm guessing some of my mild about:config tweaks are responsible. That's an indicator that your site simply uses too much weird stuff. My tweaks aren't that intrusive, sites don't break like that usually.
HTH.

Last edited by ondoho; 06-13-2021 at 02:46 AM.
 
Old 06-13-2021, 06:26 AM   #5
Trihexagonal
Member
 
Registered: Jul 2017
Location: Parts Unknown
Distribution: FreeBSD, Kali
Posts: 237

Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by Michael Uplawski View Post
Now it seems that ... I do something like that. What is your opinion?
Well, Michael, I wouldn't use my real name to publish it if I were you. I didn't bother to look because I have no compulsion to do so, but who needs to hunt you down, go through the process of contacting your webhost with boring complains to the abuse office box, which hasn't been been checked in so long there isn't room in the inbox for one more letter, or dox you when you've almost certainly doxed yourself?

This post saved me from having to make a thread, so I will be more than happy to voice my opinion on it here and explain why I gave you that friendly advise.

About a month ago while practicing my google-fu, I discovered that my Tutorial as posted in the FreeBSD forums Beginners Guide - How To Set Up A FreeBSD Desktop From Scratch was scraped by a bot and now appears on another persons site as their original material.

Now, it's not like the authorship of it is in dispute. It was featured by freebsdnews.com in one article posted under my bots name Siseneg and that article was linked to by the English and Arabic Language Facebook pages of bsdmag.org. Then again in another article after I posted it in the FreeBSD forums under my name.

And I doubt anyone will ever forget the thrill of seeing my performance in the one year tour of shameless self-promotion I went on to promote it that had you all on the edge of your seats, and got it to #1 Google ranking.

So I wasn't all that concerned about it and left a message in the comment box in a nice way to get my original material off his server like the hounds of hell were snapping at his backside. I left my name, site URL and email address but the only response I got was an addition to the top of his page that read:

"similar. You are not right. assured.."

Let's just say, that was the wrong thing to say.

Posting my original material on his site was one thing. I'm not even interested in what the DMCA says about it. Adding "Trihexagonal said" to what was obviously bot scrapings of my post, or adding FreeBSD Desktop to my Title in the HTML, without any attempt to change anything that would point right to the FreeBSD forums as my post, in addition to adding in a post I made to the comments section to answer a question After I had posted the tutorial was sloppy work at best and shows a severe lack of good judgement on his part.

This is a post I made at our forums under my Tutorial post that he included in this bot booboo:

Quote:
You're asking me about new hardware when all of my machines are Win7 vintage or older. I've lived in a large apartment complex the last 10 years and have only used wi-fi to the extent of enabling my card so I could use kismet.
He did not define himself as not being author in this part of a his plagiarism. There's that word again... I don't know if Facebook is as touchy about plagiarism of their material as me, but I provided a screenshot of my material where he presents it as his own and of the page titled "Facebook logout".

In case you missed it, because I did not, in the lower left corner you'll see where he took Credit for the Design of the Wordpress thingy he uses. After having had no luck in getting anyone to respond to my abuse complaint or getting him to take my requests seriously, much less take me seriously, I was left with little option but the use of white belt level google-fu... And behold...

In five minutes I knew everything there was to know about him, and it's him alright. I emailed him from my trihexagonal.org box with complete summary including personal photographs and his github link where he tries to present himself as a trustworthy webdesigner.

Alas, to no avail... It has been two weeks now since I contacted him and have received no response. But that's alright. He must not be one to research the people he steals from and he has yet to really get to know me. We're just starting out what will be as long as a relationship as he would like. In fact, much longer than he will like.

I don't care if you run a bot and the prospects of a bot army is something that has a certain appeal to me as a botmaster. But I think you can appreciate where I'm coming from. Maybe Facebook has the resources to put behind it to get his attention in taking their stolen material down.

I'll bring it to their attention as one of the nicer ways he will be getting to know me better.
Attached Thumbnails
Click image for larger version

Name:	Damned_if_you_do.jpg
Views:	8
Size:	242.2 KB
ID:	36600   Click image for larger version

Name:	Damned_if_you_don't.jpg
Views:	8
Size:	147.1 KB
ID:	36601  

Last edited by Trihexagonal; 06-13-2021 at 06:32 AM.
 
Old 06-13-2021, 06:51 AM   #6
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 12,999

Rep: Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714
It seems that once you don't live in 'the land of the free (etc.)' you are free to do a fair bit, until America invades.

I'm glad that it's only you guys that have to worry about things. But an interesting defence might be to point out that Google and by extension youtube etc. etc. should be in fact considered Irish companies, because they are based here for tax purposes It's a little rich for them to claim US residency when claiming copyright infringement but claim Irish residency when paying taxes.
 
Old 06-13-2021, 03:34 PM   #7
Trihexagonal
Member
 
Registered: Jul 2017
Location: Parts Unknown
Distribution: FreeBSD, Kali
Posts: 237

Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by Michael Uplawski View Post
Thanks for your remarks. In which way the page is broken; I just checked and cannot see anything odd, apart from my own choices of styles and stuff.
I liked your use of XML and hadn't looked until just now. I didn't see anything on your RSS feeds that would make me worry even if you were to publish them.

I was, however, a curious why you would add these comments to the XML Markup of your bot:

Code:
#!/bin/bash
# 2020-2020 Michael Uplawski <michael.uplawski@uplawski.eu>
# Use ths script at your own risk, modify it as you please.
# But maybe leave the copyright-notice intact. Thank You.
I'm just kidding.


Quote:
Originally Posted by business_kid View Post
It seems that once you don't live in 'the land of the free (etc.)' you are free to do a fair bit, until America invades.
You lost me there.

When did we darken your doorstep and enslave your people? What day is it?

I live in the US. The FreeBSD forums are hosted in the US. My website is hosted in Sofia, Bulgaria. That just so happens to be where this guy lives and he's Russian.

My good friend in the Ukraine speaks fluent Russian but he is on a Spiritual Journey of Enlightenment and my Malevolence would bring him down. (That's a new attribute I was recently credited with, I'm just trying it out.)

The Shores of Tripoli are Over there, Over there, I'm past the Outer Limits, there's a signpost up ahead that says the Twilight Zone, I walked through the Halls of Karma, am on the Highway to Hell and I Can't Drive 55.


Quote:
Originally Posted by business_kid View Post
I'm glad that it's only you guys that have to worry about things.
What? Me worry?


Quote:
Originally Posted by business_kid View Post
But an interesting defence might be to point out that Google and by extension youtube etc. etc. should be in fact considered Irish companies, because they are based here for tax purposes It's a little rich for them to claim US residency when claiming copyright infringement but claim Irish residency when paying taxes.
What? No tax shelters?
 
Old 06-13-2021, 04:27 PM   #8
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169

Original Poster
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
Quote:
Originally Posted by ondoho View Post
On FF-ESR with all addons disabled.
It works fine on a fresh profile though; I'm guessing some of my mild about:config tweaks are responsible. That's an indicator that your site simply uses too much weird stuff. My tweaks aren't that intrusive, sites don't break like that usually.
HTH.
HTML + CSS. Tested with a current version of HTML(5)-Tidy. It validates as XHTML 1.0 Strict. This is a mystery that I do not have the tools to clear up.

Sorry.

Last edited by Michael Uplawski; 06-13-2021 at 04:48 PM. Reason: Strict – Not Transotional
 
Old 06-13-2021, 04:38 PM   #9
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169

Original Poster
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
Quote:
Originally Posted by Trihexagonal View Post
I was, however, a curious why you would add these comments to the XML Markup of your bot:
I have not yet published anything about the Web-Bot. The comment you quoted is from a shell-script that fetches individual RSS-files. The only advantage of the shell-script and XSLT is the automation of the process which produces a nice HTML-file from the RSS.

The Web-bot is there to get information *About* available RSS-files. As I cannot currently understand the consequences of a publication, I do not explain the bot-part of my procedure.
 
Old 06-14-2021, 04:51 AM   #10
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 12,999

Rep: Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714
Quote:
Originally Posted by TriHexagonal
What? No tax shelters?
Serious points aside, I'm just wondering what geometric shape you had in mind when choosing your handle.

Anyhow, on topic, Ireland was a tax shelter until last week. Now it's a minimum of 15% (up 2.5%), and the tax is paid where the money is earned, not necessarily in the tax haven. So no tax shelters.

There's a lot that attracted big tech here. The climate meant they saved a lot on air conditioning. The presence of a large high-tech workforce meant available staff, and our EU membership was important also.

There's other things attracting the big tech companies here: when it comes to the internet, we have big pipes feeding to the Excited States, Europe, & everywhere. Some of this stuff was laid down, and the announcement was made that it future-proofed us for 10 years. At the end of 10 years, we could change the boxes on each end, and we'd be good for another while. The next Trans Atlantic pipes going down (being laid currently, or just finished) had several glass fibre cables in parallel. These were all bound into some massive specially made cable, sealed with a typical cover for heavy cables to allow it to pass over a rough underwater bottom surface instead of having to find a smooth passage like the first one did.

The first glass fibre laying job had an interesting side. A few techies made a fortune in the 1990s doing piece work on the glass fibre reel-to-reel junctions (which are anything but trivial). They had a little over 5 minutes to get two reels joined before the reel had to get rolled out. If they couldn't keep up, the ship would have run out of cable, because it couldn't stop. I presume they had a cushion of a few reels, but it gives anyone who has joined glass fibre end to end an idea of the difficulty.
 
Old 06-14-2021, 02:55 PM   #11
Trihexagonal
Member
 
Registered: Jul 2017
Location: Parts Unknown
Distribution: FreeBSD, Kali
Posts: 237

Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by business_kid View Post
Quote:
Originally Posted by TriHexagonal
What? No tax shelters?
Serious points aside, I'm just wondering what geometric shape you had in mind when choosing your handle.

The same one that represents it in my Profile here.

Just because ondoho taught you what FTFY meant, business_kid, doesn't mean you should go off on a trihexagonal tangent fixing my name when you don't understand the Geometry of it:

Quote:
Trihexagonal tiling

In geometry, the trihexagonal tiling is one of 11 uniform tilings of the Euclidean plane by regular polygons. It consists of equilateral triangles and regular hexagons, arranged so that each hexagon is surrounded by triangles and vice versa. The name derives from the fact that it combines a regular hexagonal tiling and a regular triangular tiling. Two hexagons and two triangles alternate around each vertex, and its edges form an infinite arrangement of lines. Its dual is the rhombille tiling.

This pattern, and its place in the classification of uniform tilings, was already known to Johannes Kepler in his 1619 book Harmonices Mundi. The pattern has long been used in Japanese basketry, where it is called kagome. The Japanese term for this pattern has been taken up in physics, where it is called a Kagome lattice. It occurs also in the crystal structures of certain minerals. Conway calls it a hexadeltille, combining alternate elements from a hexagonal tiling (hextille) and triangular tiling (deltille).

https://en.wikipedia.org/wiki/Trihexagonal_tiling

This one:
Attached Thumbnails
Click image for larger version

Name:	trihex.png
Views:	2
Size:	27.5 KB
ID:	36615  

Last edited by Trihexagonal; 06-14-2021 at 02:56 PM.
 
Old 06-15-2021, 05:50 AM   #12
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 12,999

Rep: Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714Reputation: 1714
To drag this interesting and free ranging thread back on topic for one brief period, I have some servers here within my geo-ip compass, and residents of the UK have even more within theirs. When it comes to scraping, I don't even know if geo-ip matters. For Slackware, Alien Bob builds BeautifulSoup, which I have installed.

Given that much, is there a good doc on scraping such sites to harvest interesting program urls? Here, I don't have to worry about the NSA, the FBI, or other forms of Big Brother caring if I scrape a few urls. And 99.9999% of computers use M$ windows anyhow. The powers that be have other priorities beside monitoring downloads or torrents, for example:
  • International Police Cooperation (You scratch my back, etc.)
  • Ireland's International Criminal Fraternity is actively tracked worldwide. One of the top guys in Ireland's most profitable and biggest gang, his wife & daughter pleaded guilty yesterday to money laundering massive amounts, properties were seized at home and abroad as the 'proceeds of crime.' That was mainly online sleuthing.
  • Certain communities are continuously watched to prevent terrorist acts. There is success here. Many of these guys can just be deported if they start making waves.
  • Others (e.g. militant anti-lockdown protesters) have their plans thwarted through IT surveillance. Ireland has a small percentage of bullies who seem to join any protest and revel in violence. They don't get much mercy from police, who at this stage know them on sight. But the police have lurkers or informants in all the right places.
 
Old 06-15-2021, 06:04 AM   #13
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169

Original Poster
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
I may have expressed myself poorly.
 
Old 06-15-2021, 06:08 AM   #14
Trihexagonal
Member
 
Registered: Jul 2017
Location: Parts Unknown
Distribution: FreeBSD, Kali
Posts: 237

Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by Michael Uplawski View Post
HTML + CSS. Tested with a current version of HTML(5)-Tidy. It validates as XHTML 1.0 Strict. This is a mystery that I do not have the tools to clear up.
I always use the W3C validator to check my pages before I upload them.

Your page renders fine for me in Firefox 78.9.0esr. It shows 5 errors on yours when I run it though the Validator.

https://validator.w3.org/check?uri=h...ne=1&verbose=1


I don't see it right now but it can be hard to track down even if you have it show the source. I had one earlier today when I updated Demonica's site I had to track down that was not apparent at all in the errors shown.


Your bots just a spider than goes out and fetches you some data. I've had them before and recently commented on gallery-dl that's available in the FreeBSD ports tree:

Quote:
gallery-dl is a command-line program to download image-galleries and
-collections from several image hosting sites. It is a cross-platform
tool with many configuration options and powerful filenaming
capabilities.

https://github.com/mikf/gallery-dl
I would say that's more invasive. I wouldn't worry too much about it myself. But that's me.


I was thinking more along the lines of Twitter or Facebook bots that make posts I could program and send out as my Surrogates. I can run 7 laptops of my own online at once. I have the Keywords part down and would have to come up with Categories to suit the purpose, but 7 of me would be an army of bots.
 
Old 06-15-2021, 06:23 AM   #15
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Location: Apples
Distribution: Apple-selling shops, markets and direct marketing
Posts: 1,169

Original Poster
Blog Entries: 30

Rep: Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666Reputation: 666
Quote:
Originally Posted by Trihexagonal View Post
I always use the W3C validator to check my pages before I upload them.

Your page renders fine for me in Firefox 78.9.0esr. It shows 5 errors on yours when I run it though the Validator.
Thank you. There were, in deed 2 errors, the others were inherited.
But the validator reports rubbish, anyway. What was missing is not an end-tag but quite simply /*]]>*/ *before* the </style> end-tag. This cleaned away 4 so-called errors at once.

The other was the fact that I placed an <img/> directly in the body-section without enclosing block-element. 1 real error and as easy to commit as to detect... beats me.

Cheerio.
 
  


Reply

Tags
civility, custom, html, legality, webdriver


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] /etc/udev/rules.d/40-libsane.rules and /etc/udev/rules.d/S99-2000S1.rules missing LABEL=libsane_rules_end mumahendras3 Slackware 6 03-09-2020 02:27 AM
LXer: Introduction to python web scraping and the Beautiful Soup library LXer Syndicated Linux News 0 09-10-2018 08:42 AM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM
LXer: Extract data from the Internet with Web scraping LXer Syndicated Linux News 0 03-29-2006 12:55 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > General

All times are GMT -5. The time now is 07:53 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration