Old 03-18-2013, 11:57 AM   #1
Registered: Jan 2006
Location: USA
Posts: 543

Searching Cache

so, this isnt exactly a Linux specific Q, but i am looking for some info.

anyone know if its possible to search the cache of the bigger engines like Gool, Bingg, Yahooo. i can see how the access to the cache can be sold as a service to say the Feds, but can the public get access?

i am working an issue where some cached pages might have data which would be a security issue for my customer.
Old 03-19-2013, 08:15 AM   #2
Senior Member
Registered: Jul 2007
Location: Directly above centre of the earth, UK
Distribution: SuSE, plus some hopping
Posts: 4,061

I'm not sure what you think is in, eg, Google's cache, but it may not work in the way that you think that it does.

In any case, to the extent that Google caches things that you are interested in, Google has access to that information. Now if the question could be 'Can some outsider break in to Google and get access to stuff that Google didn't intend them to?' then you'd have to say that while Google would tell you about all of their measures to make this impossible, if you found this a very serious outcome, you'd have to say that there can be no guarantee that it can never happen.

For most people there are bigger risks than this, but, if you were very sensitive about this particular issue, then you have a problem.

The one case that I can think of off hand where this kind of thing happened, it wasn't a search engine.
Old 03-19-2013, 11:38 AM   #3
Registered: Jan 2006
Location: USA
Posts: 543

Original Poster
ok, i know how gool cache works. i can query the cache for a specific page to see what that cached paged looks like, and this is open to the public. i want to search the public cache (query it), etc. its easier to query then it is for me to build a list of URL's and then pull thise in via php and then serach using regex, etc.

customer may have leaked some data, of which they changed their html, but engine cache's may still have a copy of pages that contain this data, etc.

does this give clarity?
Old 03-19-2013, 12:06 PM   #4
Senior Member
Registered: Nov 2005
Distribution: Arch
Posts: 3,091

Doesn't a normal search query the cache? I mean the whole point of caching is to make searches go faster, right?
Old 03-19-2013, 06:07 PM   #5
Registered: May 2001
Posts: 29,338
Blog Entries: 55

As others said a regular search does "query the cache" and AFAIK Google doesn't provide an API for bulk cache querying.
Old 03-19-2013, 10:52 PM   #6
Registered: Jan 2006
Location: USA
Posts: 543

Original Poster
nope. i didnt think i would have to explain this. google cache is a 2nd older copy of web pages, etc.

as far as i can tell, google cache is not something you can query using google search operators, hence my original question.

let me explain how cache works.
  1. goto google in FF
  2. in the search field type and click Search
  3. put mouse over the result (not web tools)
  4. now you see a double arrow icon to the right, put you mouse over that
  5. now look to the right, you see a "cached" link, click that
  6. or you can simply use in direct google search

so now you know what the cache is. i am looking for a way to use the engine operators to find specific data that is in cached pages. we the public have access to that latest cached page, can you imagine how many copies gool has, do you see how this may be useful to say the feds or local law enforcement, you change your public Facebook stuff thinking its gone, yet gool has every change you made, etc etc. i just need to query for specific data pattern in the cache that is available to the public, etc. i am thinking i need to build a uri list, use PHP to pull those from gool cache, and then grep the page content for my pattern, etc.

Old 03-19-2013, 11:36 PM   #7
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,326

If I was going to try that, I'd use Perl with WWW::Mechanize and friends to do it.
Basically you'd be looking for the code that activates that '>>' button.


