LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-28-2015, 02:25 PM   #1
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Rep: Reputation: Disabled
Finding hash collisions to recreate a file.


If algorithms find collisions for broken hash functions, then can one recreate a familiar file just from its hash value?
If there isn't a way, why not? Perhaps use a filter that relies on language recognition or quotes and an upper estimate of file size. You could restore the file if it's accidentally altered or deleted, and you could transmit the file with 100% safety!!!

No tag suggestions, so I'll let the mods tag it.
 
Old 06-28-2015, 07:42 PM   #2
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,770

Rep: Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.
 
Old 06-29-2015, 07:28 AM   #3
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rknichols View Post
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!

Last edited by catatom; 06-29-2015 at 07:54 AM.
 
Old 06-29-2015, 07:51 AM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by catatom View Post
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text.
Quote:
Originally Posted by catatom View Post
you could transmit the file with 100% safety!!!
If you succeed in doing that, you have designed a compression format. But maybe check first if existing compression algorithms are good enough for your use case.
 
Old 06-29-2015, 09:22 AM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,200

Rep: Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307
Quote:
Originally Posted by catatom View Post
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!
Interesting. I await your demo. (Yes, really).

I recommend not calling it a "hash algorithm", though. Hash algorithms are expected to be, oh, how does Wikipedia put it, non-invertible.

BTW, the typical way to restore a document from its hash value is to store both the document and its hash value somewhere, then look up the document by it's hash value. That's how rainbow tables (where the "documents" are passwords) work. I suspect that by the time you've fed your algorithm enough information to work reliably, you'll essentially have done that.

Last edited by dugan; 06-29-2015 at 09:36 AM.
 
Old 06-29-2015, 09:32 AM   #6
onebuck
Moderator
 
Registered: Jan 2005
Location: Central Florida 20 minutes from Disney World
Distribution: Slackware®
Posts: 13,922
Blog Entries: 44

Rep: Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158Reputation: 3158
Moderator response

Moved: This thread is more suitable in <Programming> and has been moved accordingly to help your thread/question get the exposure it deserves.
 
Old 06-29-2015, 10:04 AM   #7
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Original Poster
Rep: Reputation: Disabled
ntubski, thank you. I didn't know what data compression was, but Ubuntu comes with bzip2, and p7zip provides even more. Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.

dugan, unfortunately I only know some Bash and tiny bit of C++, but I'm ready to learn.

Last edited by catatom; 06-29-2015 at 01:01 PM.
 
Old 06-29-2015, 10:16 AM   #8
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
Quote:
Originally Posted by catatom View Post
Suppose each successive byte n of the hash value represents another 32^n bytes of file text.
First of, I’m assuming you meant 32*n because I can’t see how 32^n would make any sense.

Having that it mind, that’s not how hash functions work. No matter how long input is, output is always the same size. What you’re describing would yield a hash of ⌈length(message)/32⌉ bytes.
 
Old 06-29-2015, 10:51 AM   #9
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,573

Rep: Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142
As above, you're not talking about a hash algorithm, you're talking about a compression algorithm that's designed specifically for plain English ASCII text files.
 
Old 06-29-2015, 01:54 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by catatom View Post
Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.
That's because being restricted to alphanumeric outputs would increase size. You can pass the binary output through a program like base64, or ascii85 to get an alphanumeric representation.
 
Old 06-29-2015, 04:19 PM   #11
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Original Poster
Rep: Reputation: Disabled
Thank you, ntubski! If preliminary results are accurate (edit: they weren't accurate), using gzip with base64 still reduces the transcription content to less than one-thirtieth of the complete file.
The diff utility gives no output, meaning they're identical.

If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored

Last edited by catatom; 06-30-2015 at 12:11 PM.
 
Old 06-30-2015, 02:34 AM   #12
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by catatom View Post
If you don't give it a gzip suffix like .gz, then gunzip will give an error message like

my@user:~$ gunzip /path/to/gzip-file
gzip : /path/to/gzip-file: unknown suffix -- ignored
You can use
Code:
gunzip -c /path/to/gzip-file > /path/to/gunzipped-file
instead.
 
Old 06-30-2015, 08:38 AM   #13
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Original Poster
Rep: Reputation: Disabled
I compared base64 with ruby-ascii85. ascii85 output is about the same size, but it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
 
Old 06-30-2015, 08:59 AM   #14
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Right, base64 encodes 6 bits per character (log2(64) = 6), while base85 is about 6.4 bits per character (log2(85) ~ 6.40939093614): not a huge difference.

Quote:
it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
Oh yeah, I was a bit sloppy when I said above that they give alphanumeric output, rather they give text output; base64 is mostly alphanumeric (though obviously there are only so many letters and numbers to go around).
 
Old 06-30-2015, 12:10 PM   #15
catatom
LQ Newbie
 
Registered: Oct 2014
Posts: 23

Original Poster
Rep: Reputation: Disabled
Preliminary results underestimated, probably because I used repetitious gibberish. It's more like 2:5, which is too poor. LZMA2 (.xz) does higher compression ratios according to Wikipedia, but those are 1-to-1 compression algorithms. If it was more like 10,000-to-1, i.e. 10,000 collisions per compressed file, the creator could still filter by familiar characteristics, narrowing it to a few, which he could then open or save.
The computer will take some time generating and checking all those collisions, and you'll be using brain storage, but those are the only trade-offs.

Last edited by catatom; 06-30-2015 at 05:26 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
recreate DVDAuthor XML file from DVD MikeyCarter Linux - Software 4 09-04-2013 11:37 AM
LXer: Hash Tables – using hash command and available implementations LXer Syndicated Linux News 0 07-16-2013 11:20 PM
Dynamically parse BibTeX and create hash of hash wakatana Programming 11 12-13-2012 04:59 PM
Perl Hashes -- Updating a hash ref via hash value 0.o Programming 5 06-05-2012 12:45 PM
Help recreate trashed /etc/init.d/hostname.sh file erikw Linux - Newbie 5 03-24-2009 11:06 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:25 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration