Finding hash collisions to recreate a file.
If algorithms find collisions for broken hash functions, then can one recreate a familiar file just from its hash value?
If there isn't a way, why not? Perhaps use a filter that relies on language recognition or quotes and an upper estimate of file size. You could restore the file if it's accidentally altered or deleted, and you could transmit the file with 100% safety!!! No tag suggestions, so I'll let the mods tag it. |
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.
|
Quote:
|
Quote:
Quote:
|
Quote:
I recommend not calling it a "hash algorithm", though. Hash algorithms are expected to be, oh, how does Wikipedia put it, non-invertible. BTW, the typical way to restore a document from its hash value is to store both the document and its hash value somewhere, then look up the document by it's hash value. That's how rainbow tables (where the "documents" are passwords) work. I suspect that by the time you've fed your algorithm enough information to work reliably, you'll essentially have done that. |
Moderator response
Moved: This thread is more suitable in <Programming> and has been moved accordingly to help your thread/question get the exposure it deserves.
|
ntubski, thank you. I didn't know what data compression was, but Ubuntu comes with bzip2, and p7zip provides even more. Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.
dugan, unfortunately I only know some Bash and tiny bit of C++, but I'm ready to learn. |
Quote:
Having that it mind, that’s not how hash functions work. No matter how long input is, output is always the same size. What you’re describing would yield a hash of ⌈length(message)/32⌉ bytes. |
As above, you're not talking about a hash algorithm, you're talking about a compression algorithm that's designed specifically for plain English ASCII text files.
|
Quote:
|
Thank you, ntubski! If preliminary results are accurate (edit: they weren't accurate), using gzip with base64 still reduces the transcription content to less than one-thirtieth of the complete file.
The diff utility gives no output, meaning they're identical. If you don't give it a gzip suffix like .gz, then gunzip will give an error message like my@user:~$ gunzip /path/to/gzip-file gzip : /path/to/gzip-file: unknown suffix -- ignored |
Quote:
Code:
gunzip -c /path/to/gzip-file > /path/to/gunzipped-file |
I compared base64 with ruby-ascii85. ascii85 output is about the same size, but it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
|
Right, base64 encodes 6 bits per character (log2(64) = 6), while base85 is about 6.4 bits per character (log2(85) ~ 6.40939093614): not a huge difference.
Quote:
|
Preliminary results underestimated, probably because I used repetitious gibberish. It's more like 2:5, which is too poor. LZMA2 (.xz) does higher compression ratios according to Wikipedia, but those are 1-to-1 compression algorithms. If it was more like 10,000-to-1, i.e. 10,000 collisions per compressed file, the creator could still filter by familiar characteristics, narrowing it to a few, which he could then open or save.
The computer will take some time generating and checking all those collisions, and you'll be using brain storage, but those are the only trade-offs. |
All times are GMT -5. The time now is 01:19 AM. |