Quote:
Originally Posted by alabit
Is there a way to cache and permanently store single URL pages instead of entire web sites?
As for example, I frequently reference via my bookmarks a lot of mostly static and old/unmaintained HTML pages about stuff, and I wonder just how much longer some of that is going to be available online. Obviously the volume is so huge I cannot possibly archive locally by hand, so if there is an app for this please let me know!
...
Or, if you want to share how you deal with this issue - even if you use some Windows software. (I do not use or have access to a Mac or any 'smart' thingy or app, just to plain old Linux and Windows desktops). I have Linux servers and storage to spare, but I need a solution.
|
While I do not have a "solution" to offer, I can tell you how I have dealt with this very successfully for a number of years. I share your concerns and in fact, compared to 10+ years ago, the internet is becoming a desert wasteland with much useful information disappearing or becoming dilute, daily. (We have Google to thank for that in large part...IMO).
I always make it a point to grab a local copy of information I consider useful, whether whole web pages, groups of pages or copy/paste selections. Originally I kept my electronic information organized by files and directory structures, but this became a real problem for me by late 1990s.
At that time I dedicated a machine to be my "online library" and set up a LAMP stack (originally on Mandrake 7) dedicated to the purpose. It started very simply with mostly manual entry and organization, and has been refined as needed since that time - and has been in continuous use ever since.
The MOST important aspect of this is not "what program" to use, but to simply give some good thought to how you can use the tools you already have: Your GNU/Linux machine, filesystems, editors, web servers, various scripting languages, databases... and your particular knowledge and experience. You can pretty easily come up with some mix of those resources and your own organizing principles that will get it under initial control.
I doubt that you will find an effective method that does not include a fair amount of human processing, however. I pre-process anything that comes from the web before adding it to my own library - remove the ever present Google analytics scripts, ad scripts, irrelevant content. I often generate my own index pages as well. But knowledge is more valuable than gold, so be selective but effective, it is all for the good!
Don't try to grab everything, but make effort to differentiate and grab everything that is of value to you and would be considered a loss if you could not find it again. And organize as simply as possible with attention to accessibility and durability.
For example, I originally organized mine under a top level directory into numbered archive directories, and sub directories for each major item or grouping. I kept each numbered archive below what would fit on a CD and wrote a script to generate a flat HTML index at the top of each archive. This made it easy to backup and made the backups easily accessible and self contained. I now grow them to DVD size and also rsync to live backup machines as well, but the same basic organization is in place from many years ago.
So think about what you want to save and how you want to access it, and how valuable it is to you, and plan accordingly! Your GNU/Linux system provides all the tools you need! Good luck!