LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-08-2014, 01:56 PM   #1
alabit
LQ Newbie
 
Registered: Nov 2014
Location: Hoffman Estates, IL
Distribution: Debian
Posts: 27

Rep: Reputation: Disabled
Proxy to archive single static HTML pages


Is there a way to cache and permanently store single URL pages instead of entire web sites?

As for example, I frequently reference via my bookmarks a lot of mostly static and old/unmaintained HTML pages about stuff, and I wonder just how much longer some of that is going to be available online. Obviously the volume is so huge I cannot possibly archive locally by hand, so if there is an app for this please let me know!

I do not care about forum pages or other dynamic stuff, I recognize the difference, I care about pages people posted and no longer maintain and may disappear, as many over the years did in fact.

I heard about squid but that's all, and I suspect squid is not the best tool for this job as it pulls in everything from the given URL - and I only need one static HTML page out of that entire web.

Or, if you want to share how you deal with this issue - even if you use some Windows software. (I do not use or have access to a Mac or any 'smart' thingy or app, just to plain old Linux and Windows desktops). I have Linux servers and storage to spare, but I need a solution.

Thanks for reading!
 
Old 11-10-2014, 01:14 AM   #2
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,186
Blog Entries: 24

Rep: Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152Reputation: 4152
Quote:
Originally Posted by alabit View Post
Is there a way to cache and permanently store single URL pages instead of entire web sites?

As for example, I frequently reference via my bookmarks a lot of mostly static and old/unmaintained HTML pages about stuff, and I wonder just how much longer some of that is going to be available online. Obviously the volume is so huge I cannot possibly archive locally by hand, so if there is an app for this please let me know!

...

Or, if you want to share how you deal with this issue - even if you use some Windows software. (I do not use or have access to a Mac or any 'smart' thingy or app, just to plain old Linux and Windows desktops). I have Linux servers and storage to spare, but I need a solution.
While I do not have a "solution" to offer, I can tell you how I have dealt with this very successfully for a number of years. I share your concerns and in fact, compared to 10+ years ago, the internet is becoming a desert wasteland with much useful information disappearing or becoming dilute, daily. (We have Google to thank for that in large part...IMO).

I always make it a point to grab a local copy of information I consider useful, whether whole web pages, groups of pages or copy/paste selections. Originally I kept my electronic information organized by files and directory structures, but this became a real problem for me by late 1990s.

At that time I dedicated a machine to be my "online library" and set up a LAMP stack (originally on Mandrake 7) dedicated to the purpose. It started very simply with mostly manual entry and organization, and has been refined as needed since that time - and has been in continuous use ever since.

The MOST important aspect of this is not "what program" to use, but to simply give some good thought to how you can use the tools you already have: Your GNU/Linux machine, filesystems, editors, web servers, various scripting languages, databases... and your particular knowledge and experience. You can pretty easily come up with some mix of those resources and your own organizing principles that will get it under initial control.

I doubt that you will find an effective method that does not include a fair amount of human processing, however. I pre-process anything that comes from the web before adding it to my own library - remove the ever present Google analytics scripts, ad scripts, irrelevant content. I often generate my own index pages as well. But knowledge is more valuable than gold, so be selective but effective, it is all for the good!

Don't try to grab everything, but make effort to differentiate and grab everything that is of value to you and would be considered a loss if you could not find it again. And organize as simply as possible with attention to accessibility and durability.

For example, I originally organized mine under a top level directory into numbered archive directories, and sub directories for each major item or grouping. I kept each numbered archive below what would fit on a CD and wrote a script to generate a flat HTML index at the top of each archive. This made it easy to backup and made the backups easily accessible and self contained. I now grow them to DVD size and also rsync to live backup machines as well, but the same basic organization is in place from many years ago.

So think about what you want to save and how you want to access it, and how valuable it is to you, and plan accordingly! Your GNU/Linux system provides all the tools you need! Good luck!

Last edited by astrogeek; 11-10-2014 at 01:17 AM.
 
Old 11-11-2014, 08:26 AM   #3
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051Reputation: 6051
obviously you can just store the page locally.
many browsers have an option to store the "complete webpage", with images and whatnot.

also have a look at https://archive.org/web/

the problem becomes ultimately unsolvable when stored pages link to other pages that are also not available anymore.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 10 05-11-2018 11:28 AM
How to merge linked html pages into a single pdf. jan.goyvaerts Linux - Software 7 03-21-2010 05:12 PM
MS Publisher html pages for new web pages do not open in firefox, any suggestions?? Bwebman Linux - Newbie 3 06-13-2009 10:35 AM
Single sign on pages dont show up in firefox due to proxy tanveer Linux - Server 2 03-25-2008 05:50 AM
Cookie Sharing Between CGI generated HTML pages and standard HTML pages rkwhited Linux - Newbie 5 08-15-2004 07:39 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:37 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration