LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Ways to compare values in large file/db (http://www.linuxquestions.org/questions/programming-9/ways-to-compare-values-in-large-file-db-924277/)

Linux_Kidd 01-17-2012 02:06 PM

Ways to compare values in large file/db
 
in the world of risc or intel, given any OS, is there a method(s) to compare two values of large byte length for equality in an effort to get convergence in least amount of time?

as example, say i create a large flat file database that has two fields per row; string|md5-hash

this file contains the md5 hashes for all unique strings ranging in size 1-64 bytes long, and, each byte comes from pool of 256 ascii char.

now, i create a search agent, takes a md5 input and then searches the database for match(es) and returns the associated string.

so, would it be faster to search for subset indicators vs doing a grep (or similar) for the whole md5 input?

as example:
my very large database has this hash pair
Code:

n|7B8B965AD4BCA0E41AB51DE7B31363A1

is there a better way to build a "converging search" that would converge faster than just searching for the whole input string?

what about adding fields to the database while hash is being created, like this:
Code:

4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|string|md5-hash
where 4byte-value is a weird "hex-to-decimal" conversion of each sequential 4byte hex chars

so, break up 7B8B965AD4BCA0E41AB51DE7B31363A1 to
Code:

7B8B 965A D4BC A0E4 1AB5 1DE7 B313 63A1
then add each char in each group by converting hex chars to decimal and adding

Code:

7B8B = 7+11+8+11 = 37
965A = 9+6+5+10 = 30
D4BC = 13+4+11+12 = 40
A0E4 = 10+0+14+4 = 28
1AB5 = 1+10+11+5 = 27
1DE7 = 1+13+14+7 = 35
B313 = 11+3+1+3 = 18
63A1 = 6+3+10+1 = 20

so now each line in my db looks something like this
Code:

37|30|40|28|27|35|18|20|n|7B8B965AD4BCA0E41AB51DE7B31363A1
so now when i start my search i break down the input into these eight segments, calc the segment values, then try to 1st match 1st input segment value to the 1st field of a line in db, and if a match exists continue to try and match 2nd segment, etc etc. if a field doesnt match then move on to the next line starting over from segment one, and if all match then i have a hit that can be added to the "hit pool".

as you can see, the byte sizes in question have been reduced in size (matching eight values of 2 bytes each vs 32 bytes of md5 hash), but i am not sure if such a method would speed things up or slow them down. i am only interested in speeding up the search, creating the db is not in question.

noted: i have increased the collision pool with false-negatives using this method. it might not make sense in this method to compare the input hash against db hash after all segment matches were true, etc, if i did then using this method is of no value, etc. i guess if this method was way faster then simple grep and the match was way way down in the file then possibly its faster ?? eg; if this method can reach the bottom 1/3 of the large db file 10x faster than grep alone, then i have time to compare input hash to actual db hash and still being faster than grep alone, etc.

your thoughts?

kbp 01-17-2012 04:59 PM

I'm guessing this is rainbow tables? .. you don't really care about searching by the string so it would better to have the flat file sorted by the hash then do a binary search, no?

Linux_Kidd 01-17-2012 09:33 PM

Quote:

Originally Posted by kbp (Post 4577275)
I'm guessing this is rainbow tables? .. you don't really care about searching by the string so it would better to have the flat file sorted by the hash then do a binary search, no?

ah, the log2N+1 algorithm. that seems to be a better way than my linear approach.

how would i sort a bunch of 32byte strings that are in hex? standard sort is good enough?

to sort means i need to 1st generate the complete flat file, its gonna be big 256^64 lines in the db file. what system can handle sorting a file that has 1.3407 * 10^154 lines ??

the 1st set of strings is 1byte with 256 variations, + 32bytes hash/string
Code:

file size(bytes) = summation (N*256^N + (256^N * 32) + 256^N) [N=1 to 64]
the last term is for field separator

1. disk size is a problem. does google even have this amount of space?
2. at 10k hashes/sec it would take 1.3407 * 10^150 sec to generate the file! thats 4.251 x 10^142 years !!!! would need to bump that to billions of trillions of hashes/sec to make it feasible. so not only do i need the disk space, would also need mack-daddy processing that not even google has.

the task at hand is not feasible by me. where does NSA keep this table??

kbp 01-17-2012 10:56 PM

There are several apps designed to use gpu's for heavy math grinding, from memory rainbowcrack allowed generating multiple files of fixed sizes so you could go down a similar path.

The new push for passwords to be made longer rather than increasing complexity is one of the reasons for the decline of rt, if my password is 'I love a good bacon and egg breakfast on Sundays!!' having a precomputed hash is out of reach for most people due to the massive storage requirements of the db.

Linux_Kidd 01-18-2012 08:52 AM

all 256 ascii chars 64bytes wide will capture all phrases (in that char set) that are 64bytes or less. i do not believe its a manageable solution using phrases as it will burden systems with the password-reset process, which equates to loss of productivity.

i could reduce the char set to only those chars used most often from the keyboard, namely [A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()], but i have used ALT# before to get chars outside standard keyboard (most dont even know what ALT# produces, etc). reducing the char set from 256 to 94 changes db size significantly.

Code:

still a big file
file size(bytes) = summation (N*94^N + (94^N * 32) + 94^N) [N=1 to 64]

if i can sort the file then log2N+1 method makes the most sense. if it has to be a linear search then would my method be something better/faster than grep?

kbp 01-18-2012 04:41 PM

I'm not sure, I think the bottleneck would be the actual storage speed.. maybe:

* change to a filesystem that gives best performance for large files (xfs? .. need to research)
* increase number of spindles backing the volume
* multithreading: serveral search threads - give each thread a range to search like 512MB blocks

Rethinking: I can't see that your method would be any faster, sorry

Linux_Kidd 01-22-2012 11:37 PM

i am thinking i write some hashing code and test it out on amazon ec2 high capacity gpu cluster instance. spend a few $$ and see how far it gets.

kbp 01-23-2012 07:04 AM

Cool, let us know

Linux_Kidd 01-23-2012 12:40 PM

Quote:

Originally Posted by kbp (Post 4582021)
Cool, let us know

sounds like a worthwhile experiment.

ok, ec2 is the easy part. anyone have some code they would like to share (C source or shell script) which is ideal to generate md5 of all strings 1-16 bytes wide using char set
Code:

[A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()]
i dont currently know if ec2 has options for filesystem type and block size, but we can assume that the output file should not reach file size max. not sure if i can get ResierFS 3.6 (1EB max) or maybe ext3/8k block (64TB max).

file should be written "string,hash" >> file
we know total size of file(s) will be
Code:

file size(bytes) = sigma[1,16,(N*94^N + (94^N * 32) + 94^N)]
= big file

youch, ec2 only offers 1.69TB of local instance storage.

guessing we also need a timer in the code so we know how long it takes to gen the file(s). perhaps last entry made into the file is total time.


All times are GMT -5. The time now is 02:41 PM.