LinuxQuestions.org - Finding hash collisions to recreate a file.

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Finding hash collisions to recreate a file. (https://www.linuxquestions.org/questions/programming-9/finding-hash-collisions-to-recreate-a-file-4175546657/)

catatom

06-28-2015 02:25 PM

Finding hash collisions to recreate a file.

If algorithms find collisions for broken hash functions, then can one recreate a familiar file just from its hash value?
If there isn't a way, why not? Perhaps use a filter that relies on language recognition or quotes and an upper estimate of file size. You could restore the file if it's accidentally altered or deleted, and you could transmit the file with 100% safety!!!

No tag suggestions, so I'll let the mods tag it.

rknichols

06-28-2015 07:42 PM

You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.

catatom

06-29-2015 07:28 AM

Quote:

Originally Posted by rknichols (Post 5384223)

Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!

ntubski

06-29-2015 07:51 AM

Quote:

Originally Posted by catatom (Post 5384422)

Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text.

Quote:

Originally Posted by catatom (Post 5384131)

you could transmit the file with 100% safety!!!

If you succeed in doing that, you have designed a compression format. But maybe check first if existing compression algorithms are good enough for your use case.

dugan

06-29-2015 09:22 AM

Quote:

Originally Posted by catatom (Post 5384422)

Interesting. I await your demo. (Yes, really).

I recommend not calling it a "hash algorithm", though. Hash algorithms are expected to be, oh, how does Wikipedia put it, non-invertible.

BTW, the typical way to restore a document from its hash value is to store both the document and its hash value somewhere, then look up the document by it's hash value. That's how rainbow tables (where the "documents" are passwords) work. I suspect that by the time you've fed your algorithm enough information to work reliably, you'll essentially have done that.

onebuck

06-29-2015 09:32 AM

Moderator response

Moved: This thread is more suitable in <Programming> and has been moved accordingly to help your thread/question get the exposure it deserves.

catatom

06-29-2015 10:04 AM

ntubski, thank you. I didn't know what data compression was, but Ubuntu comes with bzip2, and p7zip provides even more. Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.

dugan, unfortunately I only know some Bash and tiny bit of C++, but I'm ready to learn.

mina86

06-29-2015 10:16 AM

Quote:

Originally Posted by catatom (Post 5384422)

Suppose each successive byte n of the hash value represents another 32^n bytes of file text.

First of, I’m assuming you meant 32*n because I can’t see how 32^n would make any sense.

Having that it mind, that’s not how hash functions work. No matter how long input is, output is always the same size. What you’re describing would yield a hash of ⌈length(message)/32⌉ bytes.

suicidaleggroll

06-29-2015 10:51 AM

As above, you're not talking about a hash algorithm, you're talking about a compression algorithm that's designed specifically for plain English ASCII text files.

ntubski

06-29-2015 01:54 PM

Quote:

Originally Posted by catatom (Post 5384493)

Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.

That's because being restricted to alphanumeric outputs would increase size. You can pass the binary output through a program like base64, or ascii85 to get an alphanumeric representation.

catatom

06-29-2015 04:19 PM

Thank you, ntubski! If preliminary results are accurate (edit: they weren't accurate), using gzip with base64 still reduces the transcription content to less than one-thirtieth of the complete file.
The diff utility gives no output, meaning they're identical.

If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored

ntubski

06-30-2015 02:34 AM

Quote:

Originally Posted by catatom (Post 5384689)

If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored

You can use

Code:

gunzip -c /path/to/gzip-file > /path/to/gunzipped-file

instead.

catatom

06-30-2015 08:38 AM

I compared base64 with ruby-ascii85. ascii85 output is about the same size, but it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.

ntubski

06-30-2015 08:59 AM

Right, base64 encodes 6 bits per character (log2(64) = 6), while base85 is about 6.4 bits per character (log2(85) ~ 6.40939093614): not a huge difference.

Quote:

it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.

Oh yeah, I was a bit sloppy when I said above that they give alphanumeric output, rather they give text output; base64 is mostly alphanumeric (though obviously there are only so many letters and numbers to go around).

catatom

06-30-2015 12:10 PM

Preliminary results underestimated, probably because I used repetitious gibberish. It's more like 2:5, which is too poor. LZMA2 (.xz) does higher compression ratios according to Wikipedia, but those are 1-to-1 compression algorithms. If it was more like 10,000-to-1, i.e. 10,000 collisions per compressed file, the creator could still filter by familiar characteristics, narrowing it to a few, which he could then open or save.
The computer will take some time generating and checking all those collisions, and you'll be using brain storage, but those are the only trade-offs.

All times are GMT -5. The time now is 01:19 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page