ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
If algorithms find collisions for broken hash functions, then can one recreate a familiar file just from its hash value?
If there isn't a way, why not? Perhaps use a filter that relies on language recognition or quotes and an upper estimate of file size. You could restore the file if it's accidentally altered or deleted, and you could transmit the file with 100% safety!!!
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.
You have only as much information as there are bits in the hash. For a 256 bit hash of a file even as small as 1 Kilobyte (8 Kilobits), there are potentially 2^(8192-256) or about 10^2389 different files that would yield that same hash. Good luck generating and filtering those before the sun burns out.
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text.
Quote:
Originally Posted by catatom
you could transmit the file with 100% safety!!!
If you succeed in doing that, you have designed a compression format. But maybe check first if existing compression algorithms are good enough for your use case.
Maybe it's still doable with a hash designed for low level filtering. Suppose each successive byte n of the hash value represents another 32^n bytes of file text. It's in English, so you apply an English filter. First to the first byte, then to the second byte, then the third. With an algorithm that considers an ubiquitous and memorable feature common to many files, the language, you filter out a multitude of possibilities by generating and evaluating only a small portion common to them. Win-win!
Interesting. I await your demo. (Yes, really).
I recommend not calling it a "hash algorithm", though. Hash algorithms are expected to be, oh, how does Wikipedia put it, non-invertible.
BTW, the typical way to restore a document from its hash value is to store both the document and its hash value somewhere, then look up the document by it's hash value. That's how rainbow tables (where the "documents" are passwords) work. I suspect that by the time you've fed your algorithm enough information to work reliably, you'll essentially have done that.
ntubski, thank you. I didn't know what data compression was, but Ubuntu comes with bzip2, and p7zip provides even more. Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.
dugan, unfortunately I only know some Bash and tiny bit of C++, but I'm ready to learn.
Suppose each successive byte n of the hash value represents another 32^n bytes of file text.
First of, I’m assuming you meant 32*n because I can’t see how 32^n would make any sense.
Having that it mind, that’s not how hash functions work. No matter how long input is, output is always the same size. What you’re describing would yield a hash of ⌈length(message)/32⌉ bytes.
As above, you're not talking about a hash algorithm, you're talking about a compression algorithm that's designed specifically for plain English ASCII text files.
Some of the compression formats greatly reduce size, but so far nothing in alphanumeric characters for easy manual transcription.
That's because being restricted to alphanumeric outputs would increase size. You can pass the binary output through a program like base64, or ascii85 to get an alphanumeric representation.
Thank you, ntubski! If preliminary results are accurate (edit: they weren't accurate), using gzip with base64 still reduces the transcription content to less than one-thirtieth of the complete file.
The diff utility gives no output, meaning they're identical.
If you don't give it a gzip suffix like .gz, then gunzip will give an error message like
I compared base64 with ruby-ascii85. ascii85 output is about the same size, but it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
Right, base64 encodes 6 bits per character (log2(64) = 6), while base85 is about 6.4 bits per character (log2(85) ~ 6.40939093614): not a huge difference.
Quote:
it uses lots of non-alphanumeric keyboard characters that will make manual transcription difficult.
Oh yeah, I was a bit sloppy when I said above that they give alphanumeric output, rather they give text output; base64 is mostly alphanumeric (though obviously there are only so many letters and numbers to go around).
Preliminary results underestimated, probably because I used repetitious gibberish. It's more like 2:5, which is too poor. LZMA2 (.xz) does higher compression ratios according to Wikipedia, but those are 1-to-1 compression algorithms. If it was more like 10,000-to-1, i.e. 10,000 collisions per compressed file, the creator could still filter by familiar characteristics, narrowing it to a few, which he could then open or save.
The computer will take some time generating and checking all those collisions, and you'll be using brain storage, but those are the only trade-offs.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.