Looking for command or code to check if a website has updated content

ackerman57 · 04-03-2017, 10:18 PM

Greeting all here,

I don't know where to start nor what command to use if any. Most searches want you to use some browser extension for that task, but I don't want to use browser extensions. If this can't be done in linux, then I'll use those extensions. Thanks

mrmazda · 04-03-2017, 11:46 PM

Web pages are typically generated dynamically any more. With them, any attempt to discover newness that would conceivably work would return true.

ackerman57 · 04-04-2017, 12:32 AM

Another problem is ads. Ads will change on website and report a false positive on actual content. It was worth a shot. Thanks

bathory · 04-04-2017, 12:46 AM

Quote:

Originally Posted by ackerman57

Greeting all here,

I don't know where to start nor what command to use if any. Most searches want you to use some browser extension for that task, but I don't want to use browser extensions. If this can't be done in linux, then I'll use those extensions. Thanks

You should take a look at the HEAD request

Regards

ackerman57 · 04-04-2017, 01:08 AM

Quote:

Originally Posted by bathory

You should take a look at the HEAD request

Regards

Don't know how?

BTW folks, this is not major thing I need. Don't fret too much on this :|

Jjanel · 04-04-2017, 01:15 AM

wget? http://wikipedia.org/wiki/HTTP_ETag maybe
http://thp.io/2008/urlwatch from web-search: linux check if a website has updated content
http://stackoverflow.com/questions/2...s-last-updated
http://bhfsteve.blogspot.com/2013/03...ges-using.html
A simple `curl` bash script (yes, 'dynamic' [probably 100% 'common' now-a-days] won't work); replace prowl with your choice of any command: http://www.makingyouthink.com/2015/1...s-bash-script/

ackerman57 · 04-04-2017, 01:45 AM

Hi Jjanel

I will look at those links and see what happens. Thanks

Turbocapitalist · 04-04-2017, 01:58 AM

wget will do that. So would curl. The HTTP Request Header you want to use when forming your GET or HEAD request is the If-Modified-Since header. The HTTP Date has to conform to a specific format. (In my opinion they should have gone with a subset of ISO 8601.)

Code:

wget --header="If-Modified-Since: Tue, 04 Apr 2017 05:57:29 GMT" http://www.example.com/
wget --header="If-Modified-Since: $(date -u -d 'last week' +'%a, %d %b %Y %T GMT')" http://www.example.com/

The HEAD request helps, already mentioned, if you only need the metadata not the object itself.

About ads, most ads these days are not embedded in the web page itself, but pulled in from a set of external, unvetted servers via javascript. In all likelihood, the javascript pulling in the ads, clean or tainted, will not change. However, dynamically generated pages might not have an accurate time stamp and might show only the current time and date even if the content hasn't changed for a long time. That is common with PHP sites as well as others.

bathory · 04-04-2017, 02:11 AM

Quote:

Originally Posted by ackerman57

Don't know how?

BTW folks, this is not major thing I need. Don't fret too much on this :|

Just FYI: https://2buntu.com/articles/1493/mon...-etag-headers/

ackerman57 · 04-04-2017, 03:15 AM

Quote:

Originally Posted by Turbocapitalist

wget will do that. So would curl. The HTTP Request Header you want to use when forming your GET or HEAD request is the If-Modified-Since header. The HTTP Date has to conform to a specific format. (In my opinion they should have gone with a subset of ISO 8601.)

Code:

wget --header="If-Modified-Since: Tue, 04 Apr 2017 05:57:29 GMT" http://www.example.com/
wget --header="If-Modified-Since: $(date -u -d 'last week' +'%a, %d %b %Y %T GMT')" http://www.example.com/

The HEAD request helps, already mentioned, if you only need the metadata not the object itself.

About ads, most ads these days are not embedded in the web page itself, but pulled in from a set of external, unvetted servers via javascript. In all likelihood, the javascript pulling in the ads, clean or tainted, will not change. However, dynamically generated pages might not have an accurate time stamp and might show only the current time and date even if the content hasn't changed for a long time. That is common with PHP sites as well as others.

Quote:

Originally Posted by bathory

Just FYI: https://2buntu.com/articles/1493/mon...-etag-headers/

Code:

curl -I "http://magazine.odroid.com/" 

HTTP/1.1 200 OK
Date: Tue, 04 Apr 2017 08:11:21 GMT
Server: Apache/2.4.7 (Ubuntu) SVN/1.8.8 PHP/5.5.9-1ubuntu4.21
X-Powered-By: PHP/5.5.9-1ubuntu4.21
Link: <http://magazine.odroid.com/wp-json/>; rel="https://api.w.org/"
Content-Type: text/html; charset=UTF-8

No Last-Modified and ETag headers here.

Jjanel · 04-04-2017, 04:53 AM

fwiw, playing with wget http://magazine.odroid.com I noticed that I only needed to
grep -v userSettings

ackerman57 · 04-04-2017, 06:28 PM

Quote:

Originally Posted by Jjanel

fwiw, playing with wget http://magazine.odroid.com I noticed that I only needed to
grep -v userSettings

http://magazine.odroid.com was used as an example. The real site is something else, but it doesn't have a Last-Modified and ETag headers either. I am going to try one of the links you gave me using bash and diff.

If that doesn't do it, then I just periodically check the site every few days as usual.

I want to thank everyone here who replied and for your suggestions.