LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-17-2012, 01:06 PM   #1
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 539

Rep: Reputation: 51
Ways to compare values in large file/db


in the world of risc or intel, given any OS, is there a method(s) to compare two values of large byte length for equality in an effort to get convergence in least amount of time?

as example, say i create a large flat file database that has two fields per row; string|md5-hash

this file contains the md5 hashes for all unique strings ranging in size 1-64 bytes long, and, each byte comes from pool of 256 ascii char.

now, i create a search agent, takes a md5 input and then searches the database for match(es) and returns the associated string.

so, would it be faster to search for subset indicators vs doing a grep (or similar) for the whole md5 input?

as example:
my very large database has this hash pair
Code:
n|7B8B965AD4BCA0E41AB51DE7B31363A1

is there a better way to build a "converging search" that would converge faster than just searching for the whole input string?

what about adding fields to the database while hash is being created, like this:
Code:
4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|4byte-value|string|md5-hash
where 4byte-value is a weird "hex-to-decimal" conversion of each sequential 4byte hex chars

so, break up 7B8B965AD4BCA0E41AB51DE7B31363A1 to
Code:
7B8B 965A D4BC A0E4 1AB5 1DE7 B313 63A1
then add each char in each group by converting hex chars to decimal and adding

Code:
7B8B = 7+11+8+11 = 37
965A = 9+6+5+10 = 30
D4BC = 13+4+11+12 = 40
A0E4 = 10+0+14+4 = 28
1AB5 = 1+10+11+5 = 27
1DE7 = 1+13+14+7 = 35
B313 = 11+3+1+3 = 18
63A1 = 6+3+10+1 = 20
so now each line in my db looks something like this
Code:
37|30|40|28|27|35|18|20|n|7B8B965AD4BCA0E41AB51DE7B31363A1
so now when i start my search i break down the input into these eight segments, calc the segment values, then try to 1st match 1st input segment value to the 1st field of a line in db, and if a match exists continue to try and match 2nd segment, etc etc. if a field doesnt match then move on to the next line starting over from segment one, and if all match then i have a hit that can be added to the "hit pool".

as you can see, the byte sizes in question have been reduced in size (matching eight values of 2 bytes each vs 32 bytes of md5 hash), but i am not sure if such a method would speed things up or slow them down. i am only interested in speeding up the search, creating the db is not in question.

noted: i have increased the collision pool with false-negatives using this method. it might not make sense in this method to compare the input hash against db hash after all segment matches were true, etc, if i did then using this method is of no value, etc. i guess if this method was way faster then simple grep and the match was way way down in the file then possibly its faster ?? eg; if this method can reach the bottom 1/3 of the large db file 10x faster than grep alone, then i have time to compare input hash to actual db hash and still being faster than grep alone, etc.

your thoughts?

Last edited by Linux_Kidd; 01-17-2012 at 02:11 PM.
 
Old 01-17-2012, 03:59 PM   #2
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,758

Rep: Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643
I'm guessing this is rainbow tables? .. you don't really care about searching by the string so it would better to have the flat file sorted by the hash then do a binary search, no?
 
Old 01-17-2012, 08:33 PM   #3
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 539

Original Poster
Rep: Reputation: 51
Quote:
Originally Posted by kbp View Post
I'm guessing this is rainbow tables? .. you don't really care about searching by the string so it would better to have the flat file sorted by the hash then do a binary search, no?
ah, the log2N+1 algorithm. that seems to be a better way than my linear approach.

how would i sort a bunch of 32byte strings that are in hex? standard sort is good enough?

to sort means i need to 1st generate the complete flat file, its gonna be big 256^64 lines in the db file. what system can handle sorting a file that has 1.3407 * 10^154 lines ??

the 1st set of strings is 1byte with 256 variations, + 32bytes hash/string
Code:
file size(bytes) = summation (N*256^N + (256^N * 32) + 256^N) [N=1 to 64]
the last term is for field separator

1. disk size is a problem. does google even have this amount of space?
2. at 10k hashes/sec it would take 1.3407 * 10^150 sec to generate the file! thats 4.251 x 10^142 years !!!! would need to bump that to billions of trillions of hashes/sec to make it feasible. so not only do i need the disk space, would also need mack-daddy processing that not even google has.

the task at hand is not feasible by me. where does NSA keep this table??

Last edited by Linux_Kidd; 01-17-2012 at 08:56 PM.
 
Old 01-17-2012, 09:56 PM   #4
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,758

Rep: Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643
There are several apps designed to use gpu's for heavy math grinding, from memory rainbowcrack allowed generating multiple files of fixed sizes so you could go down a similar path.

The new push for passwords to be made longer rather than increasing complexity is one of the reasons for the decline of rt, if my password is 'I love a good bacon and egg breakfast on Sundays!!' having a precomputed hash is out of reach for most people due to the massive storage requirements of the db.
 
Old 01-18-2012, 07:52 AM   #5
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 539

Original Poster
Rep: Reputation: 51
all 256 ascii chars 64bytes wide will capture all phrases (in that char set) that are 64bytes or less. i do not believe its a manageable solution using phrases as it will burden systems with the password-reset process, which equates to loss of productivity.

i could reduce the char set to only those chars used most often from the keyboard, namely [A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()], but i have used ALT# before to get chars outside standard keyboard (most dont even know what ALT# produces, etc). reducing the char set from 256 to 94 changes db size significantly.

Code:
still a big file
file size(bytes) = summation (N*94^N + (94^N * 32) + 94^N) [N=1 to 64]
if i can sort the file then log2N+1 method makes the most sense. if it has to be a linear search then would my method be something better/faster than grep?
 
Old 01-18-2012, 03:41 PM   #6
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,758

Rep: Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643
I'm not sure, I think the bottleneck would be the actual storage speed.. maybe:

* change to a filesystem that gives best performance for large files (xfs? .. need to research)
* increase number of spindles backing the volume
* multithreading: serveral search threads - give each thread a range to search like 512MB blocks

Rethinking: I can't see that your method would be any faster, sorry

Last edited by kbp; 01-18-2012 at 03:42 PM.
 
Old 01-22-2012, 10:37 PM   #7
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 539

Original Poster
Rep: Reputation: 51
i am thinking i write some hashing code and test it out on amazon ec2 high capacity gpu cluster instance. spend a few $$ and see how far it gets.
 
Old 01-23-2012, 06:04 AM   #8
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,758

Rep: Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643Reputation: 643
Cool, let us know
 
Old 01-23-2012, 11:40 AM   #9
Linux_Kidd
Member
 
Registered: Jan 2006
Location: USA
Posts: 539

Original Poster
Rep: Reputation: 51
Quote:
Originally Posted by kbp View Post
Cool, let us know
sounds like a worthwhile experiment.

ok, ec2 is the easy part. anyone have some code they would like to share (C source or shell script) which is ideal to generate md5 of all strings 1-16 bytes wide using char set
Code:
[A-Za-z0-9`~-_=+[{]};:'",<.>/?!@#$%^&*()]
i dont currently know if ec2 has options for filesystem type and block size, but we can assume that the output file should not reach file size max. not sure if i can get ResierFS 3.6 (1EB max) or maybe ext3/8k block (64TB max).

file should be written "string,hash" >> file
we know total size of file(s) will be
Code:
file size(bytes) = sigma[1,16,(N*94^N + (94^N * 32) + 94^N)]
= big file
youch, ec2 only offers 1.69TB of local instance storage.

guessing we also need a timer in the code so we know how long it takes to gen the file(s). perhaps last entry made into the file is total time.

Last edited by Linux_Kidd; 01-23-2012 at 01:20 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] C Arguments to main - can't compare values madsovenielsen Programming 2 07-25-2011 07:17 AM
[SOLVED] large tar nightly + rsync to compare question sir-lancealot Linux - Server 5 01-05-2011 02:28 PM
Bash: Insert output of a commando into an array and compare the values tengblad Programming 2 04-07-2009 03:36 PM
Compare Decimal Values rbautch Linux - General 10 04-22-2008 04:11 AM
How do I compare two values in separate files xmrkite Linux - Software 11 09-19-2007 11:36 PM


All times are GMT -5. The time now is 09:44 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration