ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).
They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).
-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?
IIRC, in the FLOSS (#35) interview with Brian Aker of Drizzle, he mentioned working on a memory caching server as well as drizzle. It is used in a server cluster environment.
I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).
They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).
-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?
Thanks.
It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:
The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?
It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:
The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?
They key space is not fully populated, and will be distributed randomly.
The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
They key space is not fully populated, and will be distributed randomly.
The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
I can't give you a direct answer. But, given 1 million records and say 64 bytes per record, that's only 64 megabytes. You won't need anything special in memory in order for the whole thing to be cached in memory. It wouldn't matter how fast your filesystem is if the whole thing is in memory, but of course the OS will work away at updating the file on disk. Sixty-four megabytes isn't really that big a file. When you close the file, the OS will write everything back to disk. It would be very easy for you to write a test case program for this, which would the file using records selected by a random number generator.
All this begs the question of what happens in a power outage or some other catastrophe, of course. I still think you should use a DBMS, if nothing else than for the ability to do fallbacks and restarts. A test case using a DBMS wouldn't be all that difficult to write, either.
The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
I think std::multimap uses a binary search to find elements and you might lose that ability with a file, though you could mmap a file (you'd still need an index.)
What type of key will you use? If you have a continuous interval of keys or one with few holes you could just create an array of the range of indexes (recommend mmap for this) and use the index to dereference with []. For example, if you have integer keys ranging from 0 to 1,000,000, you can create an array of 1,000,000 elements and access e.g. the element for key 10 with [10]. A presized std::vector would work, too.
Short of the above, you'll still have to have some sort of search involved in every access operation.
ta0kira
I was thinking of using directories, and files and usign the filesystem
root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.
I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.
This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.
A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?
Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
I was thinking of using directories, and files and usign the filesystem
root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.
I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.
This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.
Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.
Quote:
A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?
Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?
If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.
Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.
Last edited by Quakeboy02; 09-05-2008 at 12:17 AM.
I was thinking of using directories, and files and usign the filesystem
root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.
I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.
This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.
A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?
Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
try mounting with "sync" and see if it still works. Writes have to catch up eventually. That sort of seems like a db structure, anyway.
ta0kira
Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.
You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?
If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.
Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.
I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.
Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.
I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...
Ugh - this is starting to become a difficult problem. Thanks.
I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.
Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.
I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...
Ugh - this is starting to become a difficult problem. Thanks.
Have you tried a test case of just opening a file and read/writing to it?
You definitely cannot use the filesystem for this... not "millions of files."
Now... is there room for a "trick?" If you've got hundreds of updates coming in every second, how-often and how-rapidly do you actually need to use the updated information? And, how often is the same key updated more-than-once in a short time? Is the next query of a key's value likely to refer to a key that had recently been updated?
This is exactly the sort of "trickery" that makes mechanisms like (say...) "virtual memory" work so very well: the seemingly-impossible requirement "to allow many programs to each believe that they possess a private-memory space far in excess of that of the real hardware" can be met, thanks to a few simple "tricks."
Speaking of memory... the recognition that "it is 'only' 64 megabytes" might be both prescient and perfectly-valid if we can (in essence...) lock that memory in such a way that a page-fault is "very unlikely." If we can arrange things so that any reference to this area is unlikely to result in an expensive page-fault, then, yes, "memory might be the perfect tool."
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.