LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-03-2008, 07:39 PM   #1
true_atlantis
Member
 
Registered: Oct 2003
Distribution: fedora cor 5 x86_64
Posts: 639

Rep: Reputation: 30
multimap with > 1 million entries = problems?


I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).

They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).

-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?

Thanks.
 
Old 09-03-2008, 10:29 PM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
IIRC, in the FLOSS (#35) interview with Brian Aker of Drizzle, he mentioned working on a memory caching server as well as drizzle. It is used in a server cluster environment.
 
Old 09-03-2008, 10:40 PM   #3
Quakeboy02
Senior Member
 
Registered: Nov 2006
Distribution: Debian Linux 11 (Bullseye)
Posts: 3,407

Rep: Reputation: 141Reputation: 141
Quote:
Originally Posted by true_atlantis View Post
I am planning on writing an application that will have to update 'records' hundreds of times a second, that need to be accessed by a key. Any sort of DB will be too slow, I was thinking that an in memory std::multimap will work (written to a file every X time period in case of crashes).

They key to this will be a set of hex numbers (similar to a MAC address), the value will be an object that contains 2 timestamps(int), a char* and a state(one of 3 values).

-Are there any known issues when you get a std::multimap up into the millions of items?
-Does anyone else have any other suggestion on how to manage this amount of data?

Thanks.
It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:

The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?
 
Old 09-03-2008, 11:02 PM   #4
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
It is memcached that was mentioned in the FLOSS episode.
 
Old 09-04-2008, 09:56 AM   #5
true_atlantis
Member
 
Registered: Oct 2003
Distribution: fedora cor 5 x86_64
Posts: 639

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by Quakeboy02 View Post
It doesn't seem to be an inordinate amount of data for modern computers. Here are some thoughts:

The key issue is your access key. Is the key space fully populated? If so, consider an index instead of a key. If not, consider hashing on the first x number of digits in the key or some repeating characteristic of your key. Are you sure that a regular DBMS would be too slow? Are your updates really going to be randomly distributed? Would it be better to just use a file and let Linux's file and memory caching systems do all the dirty work while you just access on record numbers?
They key space is not fully populated, and will be distributed randomly.

The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
 
Old 09-04-2008, 08:48 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,360

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
I'd say use a hash in C (or C++, Java, Perl etc).
Possibly SQLLite.
 
Old 09-04-2008, 08:59 PM   #7
Quakeboy02
Senior Member
 
Registered: Nov 2006
Distribution: Debian Linux 11 (Bullseye)
Posts: 3,407

Rep: Reputation: 141Reputation: 141
Quote:
Originally Posted by true_atlantis View Post
They key space is not fully populated, and will be distributed randomly.

The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
I can't give you a direct answer. But, given 1 million records and say 64 bytes per record, that's only 64 megabytes. You won't need anything special in memory in order for the whole thing to be cached in memory. It wouldn't matter how fast your filesystem is if the whole thing is in memory, but of course the OS will work away at updating the file on disk. Sixty-four megabytes isn't really that big a file. When you close the file, the OS will write everything back to disk. It would be very easy for you to write a test case program for this, which would the file using records selected by a random number generator.

All this begs the question of what happens in a power outage or some other catastrophe, of course. I still think you should use a DBMS, if nothing else than for the ability to do fallbacks and restarts. A test case using a DBMS wouldn't be all that difficult to write, either.
 
Old 09-04-2008, 09:25 PM   #8
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by true_atlantis View Post
The other though was using a filesystem, the only downfall would be read/write time. Do you have any experience, or know on how fast that can be done?
I think std::multimap uses a binary search to find elements and you might lose that ability with a file, though you could mmap a file (you'd still need an index.)

What type of key will you use? If you have a continuous interval of keys or one with few holes you could just create an array of the range of indexes (recommend mmap for this) and use the index to dereference with []. For example, if you have integer keys ranging from 0 to 1,000,000, you can create an array of 1,000,000 elements and access e.g. the element for key 10 with [10]. A presized std::vector would work, too.

Short of the above, you'll still have to have some sort of search involved in every access operation.
ta0kira

Last edited by ta0kira; 09-04-2008 at 09:30 PM.
 
Old 09-04-2008, 10:34 PM   #9
true_atlantis
Member
 
Registered: Oct 2003
Distribution: fedora cor 5 x86_64
Posts: 639

Original Poster
Rep: Reputation: 30
I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.

A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
 
Old 09-05-2008, 12:10 AM   #10
Quakeboy02
Senior Member
 
Registered: Nov 2006
Distribution: Debian Linux 11 (Bullseye)
Posts: 3,407

Rep: Reputation: 141Reputation: 141
Quote:
Originally Posted by true_atlantis View Post
I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.
Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.

Quote:
A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?

If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.

Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.

Last edited by Quakeboy02; 09-05-2008 at 12:17 AM.
 
Old 09-05-2008, 12:11 AM   #11
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,360

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
You may also run out of inodes that way...
Benchmark a DB before you say it can't be done.
 
Old 09-05-2008, 06:54 AM   #12
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by true_atlantis View Post
I was thinking of using directories, and files and usign the filesystem

root/<key1>/<obj1>
root/<key1>/<obj2>
root/<key2>/<obj1>
etc.

I did a test case and 1000 reads, followed by 1000 write/mkdir took 0.35sec.

This will be fast enough for my cause, as well as being able to be indexable by taking advantage of directory structure. I will also have the ability to have a current state at any time t.

A database is the ideal solution, but read/write will take too long. My experience has been with mysql/postgres and have access to oracle license. How is sqlite in terms of read write speed?

Essentially it comes down to if I can do ~1000 read/writes per second consecutively... I dont think mysql will perform - but have not done any testing on a database.
try mounting with "sync" and see if it still works. Writes have to catch up eventually. That sort of seems like a db structure, anyway.
ta0kira
 
Old 09-11-2008, 10:11 AM   #13
true_atlantis
Member
 
Registered: Oct 2003
Distribution: fedora cor 5 x86_64
Posts: 639

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by Quakeboy02 View Post
Looks like a mess to me, TBH. You're talking about opening a million files to do it this way. The system overhead for this won't be pretty, and you'll lose much of what you hope to gain from it. To be blunt, I'd hate to be on a team constrained to this design. You're only talking about a few tens or maybe a hundred megabytes on the disk. The OS will cache that without even breaking a sweat.



You should do so. At this point you seem to be prejudiced against the proper solution. Have you considered the possibility of using parallel processing (threading) in both the application and in the database engine? Multiple processors and lots of memory would make this application loaf along. Or are you limited to hardware on hand?

If you have to write your own rollback processor, you won't be happy, or is this a case where you're collecting sensor data and you don't mind having to restart a run? If it is, then why not just record the sensor run and process it at your leisure? When I worked for the airlines we collected reservation data for one of our customers throughout the day, and did a file transfer in the wee hours of the night. I wrote a processing application capable of restarts that had the data ready for them in Oracle when they got to work in the morning. Rollback/restart capability was a godsend.

Thus far, you haven't even considered the idea of using a tree structure within a filesystem, or at least not here. There are numerous possibilities to consider for logical data storage, even after you make the decision as to how to physically store it on disk.

I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.

Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.

I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...

Ugh - this is starting to become a difficult problem. Thanks.
 
Old 09-11-2008, 10:57 AM   #14
Quakeboy02
Senior Member
 
Registered: Nov 2006
Distribution: Debian Linux 11 (Bullseye)
Posts: 3,407

Rep: Reputation: 141Reputation: 141
Quote:
Originally Posted by true_atlantis View Post
I am definitely going to be threading this application. One of the ideas was to keep the whole data structure in memory, and have one thread periodically write it out to a DB/File/etc.

Seeing that this may be more like 10k updates/sec and more like a 9mil record data, I may be SOL no matter what I do.

I cannot process the data after the fact because the whole process is detecting weather or not a subscriber should/shouldnt have service, and if they shouldnt have service, denying them...

Ugh - this is starting to become a difficult problem. Thanks.
Have you tried a test case of just opening a file and read/writing to it?
 
Old 09-11-2008, 11:23 PM   #15
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,665
Blog Entries: 4

Rep: Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945
You definitely cannot use the filesystem for this... not "millions of files."

Now... is there room for a "trick?" If you've got hundreds of updates coming in every second, how-often and how-rapidly do you actually need to use the updated information? And, how often is the same key updated more-than-once in a short time? Is the next query of a key's value likely to refer to a key that had recently been updated?

This is exactly the sort of "trickery" that makes mechanisms like (say...) "virtual memory" work so very well: the seemingly-impossible requirement "to allow many programs to each believe that they possess a private-memory space far in excess of that of the real hardware" can be met, thanks to a few simple "tricks."

Speaking of memory... the recognition that "it is 'only' 64 megabytes" might be both prescient and perfectly-valid if we can (in essence...) lock that memory in such a way that a page-fault is "very unlikely." If we can arrange things so that any reference to this area is unlikely to result in an expensive page-fault, then, yes, "memory might be the perfect tool."
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Wikimedia to Sloan: Thanks a million, thanks a million, thanks a million LXer Syndicated Linux News 0 03-26-2008 02:50 PM
how to sort output at latest entries without disturbing the previous entries record nabmufti Programming 4 02-11-2008 11:36 PM
LXer: What do you get with a million penguins? LXer Syndicated Linux News 0 03-03-2007 03:48 PM
A Million Questions jeffChuck Slackware - Installation 10 07-19-2004 07:23 PM
Here to ask the $5 million Question!! crumb Linux - Distributions 9 07-17-2003 05:05 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration