Ways to compare values in large file/db
in the world of risc or intel, given any OS, is there a method(s) to compare two values of large byte length for equality in an effort to get convergence in least amount of time?
as example, say i create a large flat file database that has two fields per row; string|md5-hash this file contains the md5 hashes for all unique strings ranging in size 1-64 bytes long, and, each byte comes from pool of 256 ascii char. now, i create a search agent, takes a md5 input and then searches the database for match(es) and returns the associated string. so, would it be faster to search for subset indicators vs doing a grep (or similar) for the whole md5 input? as example: my very large database has this hash pair Code:
n|7B8B965AD4BCA0E41AB51DE7B31363A1 is there a better way to build a "converging search" that would converge faster than just searching for the whole input string? what about adding fields to the database while hash is being created, like this: Code:
4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|string|md5-hash so, break up 7B8B965AD4BCA0E41AB51DE7B31363A1 to Code:
7B8B 965A D4BC A0E4 1AB5 1DE7 B313 63A1 Code:
7B8B = 7+11+8+11 = 37 Code:
37|30|40|28|27|35|18|20|n|7B8B965AD4BCA0E41AB51DE7B31363A1 as you can see, the byte sizes in question have been reduced in size (matching eight values of 2 bytes each vs 32 bytes of md5 hash), but i am not sure if such a method would speed things up or slow them down. i am only interested in speeding up the search, creating the db is not in question. noted: i have increased the collision pool with false-negatives using this method. it might not make sense in this method to compare the input hash against db hash after all segment matches were true, etc, if i did then using this method is of no value, etc. i guess if this method was way faster then simple grep and the match was way way down in the file then possibly its faster ?? eg; if this method can reach the bottom 1/3 of the large db file 10x faster than grep alone, then i have time to compare input hash to actual db hash and still being faster than grep alone, etc. your thoughts? |
I'm guessing this is rainbow tables? .. you don't really care about searching by the string so it would better to have the flat file sorted by the hash then do a binary search, no?
|
Quote:
how would i sort a bunch of 32byte strings that are in hex? standard sort is good enough? to sort means i need to 1st generate the complete flat file, its gonna be big 256^64 lines in the db file. what system can handle sorting a file that has 1.3407 * 10^154 lines ?? the 1st set of strings is 1byte with 256 variations, + 32bytes hash/string Code:
file size(bytes) = summation (N*256^N + (256^N * 32) + 256^N) [N=1 to 64] 1. disk size is a problem. does google even have this amount of space? 2. at 10k hashes/sec it would take 1.3407 * 10^150 sec to generate the file! thats 4.251 x 10^142 years !!!! would need to bump that to billions of trillions of hashes/sec to make it feasible. so not only do i need the disk space, would also need mack-daddy processing that not even google has. the task at hand is not feasible by me. where does NSA keep this table?? |
There are several apps designed to use gpu's for heavy math grinding, from memory rainbowcrack allowed generating multiple files of fixed sizes so you could go down a similar path.
The new push for passwords to be made longer rather than increasing complexity is one of the reasons for the decline of rt, if my password is 'I love a good bacon and egg breakfast on Sundays!!' having a precomputed hash is out of reach for most people due to the massive storage requirements of the db. |
all 256 ascii chars 64bytes wide will capture all phrases (in that char set) that are 64bytes or less. i do not believe its a manageable solution using phrases as it will burden systems with the password-reset process, which equates to loss of productivity.
i could reduce the char set to only those chars used most often from the keyboard, namely [A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()], but i have used ALT# before to get chars outside standard keyboard (most dont even know what ALT# produces, etc). reducing the char set from 256 to 94 changes db size significantly. Code:
still a big file |
I'm not sure, I think the bottleneck would be the actual storage speed.. maybe:
* change to a filesystem that gives best performance for large files (xfs? .. need to research) * increase number of spindles backing the volume * multithreading: serveral search threads - give each thread a range to search like 512MB blocks Rethinking: I can't see that your method would be any faster, sorry |
i am thinking i write some hashing code and test it out on amazon ec2 high capacity gpu cluster instance. spend a few $$ and see how far it gets.
|
Cool, let us know
|
Quote:
ok, ec2 is the easy part. anyone have some code they would like to share (C source or shell script) which is ideal to generate md5 of all strings 1-16 bytes wide using char set Code:
[A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()] file should be written "string,hash" >> file we know total size of file(s) will be Code:
file size(bytes) = sigma[1,16,(N*94^N + (94^N * 32) + 94^N)] guessing we also need a timer in the code so we know how long it takes to gen the file(s). perhaps last entry made into the file is total time. |
All times are GMT -5. The time now is 09:43 AM. |