multimap with > 1 million entries = problems?

true_atlantis · 09-03-2008, 07:39 PM

I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).

They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).

-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?

Thanks.

jschiwal · 09-03-2008, 10:29 PM

IIRC, in the FLOSS (#35) interview with Brian Aker of Drizzle, he mentioned working on a memory caching server as well as drizzle. It is used in a server cluster environment.

Quakeboy02 · 09-03-2008, 10:40 PM

Quote:

Originally Posted by true_atlantis

I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).

They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).

-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?

Thanks.

It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:

The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?

jschiwal · 09-03-2008, 11:02 PM

It is memcached that was mentioned in the FLOSS episode.

true_atlantis · 09-04-2008, 09:56 AM

Quote:

Originally Posted by Quakeboy02

It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:

The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?

They key space is not fully populated, and will be distributed randomly.

The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?

chrism01 · 09-04-2008, 08:48 PM

I'd say use a hash in C (or C++, Java, Perl etc).
Possibly SQLLite.

Quakeboy02 · 09-04-2008, 08:59 PM

Quote:

Originally Posted by true_atlantis

They key space is not fully populated, and will be distributed randomly.

The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?

I can't give you a direct answer. But, given 1 million records and say 64 bytes per record, that's only 64 megabytes. You won't need anything special in memory in order for the whole thing to be cached in memory. It wouldn't matter how fast your filesystem is if the whole thing is in memory, but of course the OS will work away at updating the file on disk. Sixty-four megabytes isn't really that big a file. When you close the file, the OS will write everything back to disk. It would be very easy for you to write a test case program for this, which would the file using records selected by a random number generator.

All this begs the question of what happens in a power outage or some other catastrophe, of course. I still think you should use a DBMS, if nothing else than for the ability to do fallbacks and restarts. A test case using a DBMS wouldn't be all that difficult to write, either.

ta0kira · 09-04-2008, 09:25 PM

Quote:

Originally Posted by true_atlantis

The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?

I think std::multimap uses a binary search to find elements and you might lose that ability with a file, though you could mmap a file (you'd still need an index.)

What type of key will you use? If you have a continuous interval of keys or one with few holes you could just create an array of the range of indexes (recommend mmap for this) and use the index to dereference with []. For example, if you have integer keys ranging from 0 to 1,000,000, you can create an array of 1,000,000 elements and access e.g. the element for key 10 with [10]. A presized std::vector would work, too.

Short of the above, you'll still have to have some sort of search involved in every access operation.
ta0kira

true_atlantis · 09-04-2008, 10:34 PM

I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.

A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.

Quakeboy02 · 09-05-2008, 12:10 AM

Quote:

Originally Posted by true_atlantis

I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.

Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.

Quote:

A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.

You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?

If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.

Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.

chrism01 · 09-05-2008, 12:11 AM

You may also run out of inodes that way...
Benchmark a DB before you say it can't be done.

ta0kira · 09-05-2008, 06:54 AM

Quote:

Originally Posted by true_atlantis

I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.

A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.

try mounting with "sync" and see if it still works. Writes have to catch up eventually. That sort of seems like a db structure, anyway.
ta0kira

true_atlantis · 09-11-2008, 10:11 AM

Quote:

Originally Posted by Quakeboy02

Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.

You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?

If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.

Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.

I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.

Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.

I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...

Ugh - this is starting to become a difficult problem. Thanks.

Quakeboy02 · 09-11-2008, 10:57 AM

Quote:

Originally Posted by true_atlantis

I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.

Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.

I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...

Ugh - this is starting to become a difficult problem. Thanks.

Have you tried a test case of just opening a file and read/writing to it?

sundialsvcs · 09-11-2008, 11:23 PM

You definitely cannot use the filesystem for this... not "millions of files."

Now... is there room for a "trick?" If you've got hundreds of updates coming in every second, how-often and how-rapidly do you actually need to use the updated information? And, how often is the same key updated more-than-once in a short time? Is the next query of a key's value likely to refer to a key that had recently been updated?

This is exactly the sort of "trickery" that makes mechanisms like (say...) "virtual memory" work so very well: the seemingly-impossible requirement "to allow many programs to each believe that they possess a private-memory space far in excess of that of the real hardware" can be met, thanks to a few simple "tricks."

Speaking of memory... the recognition that "it is 'only' 64 megabytes" might be both prescient and perfectly-valid if we can (in essence...) lock that memory in such a way that a page-fault is "very unlikely." If we can arrange things so that any reference to this area is unlikely to result in an expensive page-fault, then, yes, "memory might be the perfect tool."