LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Finding hash collisions to recreate a file. (https://www.linuxquestions.org/questions/programming-9/finding-hash-collisions-to-recreate-a-file-4175546657/)

catatom 06-28-2015 02:25 PM

Finding hash collisions to recreate a file.
 
If algorithms find collisions for broken hash functions, then can one recreate a familiar file just from its hash value?
If there isn't a way, why not? Perhaps use a filter that relies on language recognition or quotes and an upper estimate of file size. You could restore the file if it's accidentally altered or deleted, and you could transmit the file with 100% safety!!!

No tag suggestions, so I'll let the mods tag it.

rknichols 06-28-2015 07:42 PM

You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.

catatom 06-29-2015 07:28 AM

Quote:

Originally Posted by rknichols (Post 5384223)
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.

Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!

ntubski 06-29-2015 07:51 AM

Quote:

Originally Posted by catatom (Post 5384422)
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text.

Quote:

Originally Posted by catatom (Post 5384131)
you could transmit the file with 100% safety!!!

If you succeed in doing that, you have designed a compression format. But maybe check first if existing compression algorithms are good enough for your use case.

dugan 06-29-2015 09:22 AM

Quote:

Originally Posted by catatom (Post 5384422)
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!

Interesting. I await your demo. (Yes, really).

I recommend not calling it a "hash algorithm", though. Hash algorithms are expected to be, oh, how does Wikipedia put it, non-invertible.

BTW, the typical way to restore a document from its hash value is to store both the document and its hash value somewhere, then look up the document by it's hash value. That's how rainbow tables (where the "documents" are passwords) work. I suspect that by the time you've fed your algorithm enough information to work reliably, you'll essentially have done that.

onebuck 06-29-2015 09:32 AM

Moderator response
 
Moved: This thread is more suitable in <Programming> and has been moved accordingly to help your thread/question get the exposure it deserves.

catatom 06-29-2015 10:04 AM

ntubski, thank you. I didn't know what data compression was, but Ubuntu comes with bzip2, and p7zip provides even more. Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.

dugan, unfortunately I only know some Bash and tiny bit of C++, but I'm ready to learn.

mina86 06-29-2015 10:16 AM

Quote:

Originally Posted by catatom (Post 5384422)
Suppose each successive byte n of the hash value represents another 32^n bytes of file text.

First of, I’m assuming you meant 32*n because I can’t see how 32^n would make any sense.

Having that it mind, that’s not how hash functions work. No matter how long input is, output is always the same size. What you’re describing would yield a hash of ⌈length(message)/32⌉ bytes.

suicidaleggroll 06-29-2015 10:51 AM

As above, you're not talking about a hash algorithm, you're talking about a compression algorithm that's designed specifically for plain English ASCII text files.

ntubski 06-29-2015 01:54 PM

Quote:

Originally Posted by catatom (Post 5384493)
Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.

That's because being restricted to alphanumeric outputs would increase size. You can pass the binary output through a program like base64, or ascii85 to get an alphanumeric representation.

catatom 06-29-2015 04:19 PM

Thank you, ntubski! If preliminary results are accurate (edit: they weren't accurate), using gzip with base64 still reduces the transcription content to less than one-thirtieth of the complete file.
The diff utility gives no output, meaning they're identical.

If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored

ntubski 06-30-2015 02:34 AM

Quote:

Originally Posted by catatom (Post 5384689)
If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored

You can use
Code:

gunzip -c /path/to/gzip-file > /path/to/gunzipped-file
instead.

catatom 06-30-2015 08:38 AM

I compared base64 with ruby-ascii85. ascii85 output is about the same size, but it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.

ntubski 06-30-2015 08:59 AM

Right, base64 encodes 6 bits per character (log2(64) = 6), while base85 is about 6.4 bits per character (log2(85) ~ 6.40939093614): not a huge difference.

Quote:

it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
Oh yeah, I was a bit sloppy when I said above that they give alphanumeric output, rather they give text output; base64 is mostly alphanumeric (though obviously there are only so many letters and numbers to go around).

catatom 06-30-2015 12:10 PM

Preliminary results underestimated, probably because I used repetitious gibberish. It's more like 2:5, which is too poor. LZMA2 (.xz) does higher compression ratios according to Wikipedia, but those are 1-to-1 compression algorithms. If it was more like 10,000-to-1, i.e. 10,000 collisions per compressed file, the creator could still filter by familiar characteristics, narrowing it to a few, which he could then open or save.
The computer will take some time generating and checking all those collisions, and you'll be using brain storage, but those are the only trade-offs.


All times are GMT -5. The time now is 01:19 AM.