ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I need to load thousands of records in a text file into memory. The record comprises of two fields: one is number and other is string
Eg:
8914,BHR
9918,ORS
Which is the best data structure to hold this data? The operations performed on this data would be:
1) Exhaustive Searching
2) Limited addition, addition of 5-10 records a day.
Also, what will be the best technique to insert new record dynamically in this scenario without being restarting the program? Currently i am using IPC (Linux Message Queues) and Signals to indicate the C binary for availability of new records.
What type of searching are you doing? I suggest a balanced binary tree sorted based on either of the columns, depending on which is more important. Alternatively, you could represent each as a pair (e.g. struct pair { int val; char name[32]; };) and have two balanced binary trees, each sorted based on one of the columns, but holding a pointer to a struct pair (both trees representing a different ordering of the same data.) This is really only helpful if sorting order will help with your search, though.
Kevin Barry
I need to load thousands of records in a text file into memory.
Thousands is actually pretty trivial for a computer. So it might not matter much how you store the data.
Quote:
1) Exhaustive Searching
Searching for which field? Or is it both? Searching for exact match of a field? Or for partial match?
For exact match searching, a hash is best.
If ordering matters or for partial matches on the beginning of a field, a tree structure such as red black or AVL is likely best.
To search on either key, you might want a hash or tree for each.
For more complicated fast partial match searching you might want a more complicated index.
Quote:
Originally Posted by meenakshi.khurana
2) Limited addition, addition of 5-10 records a day.
Also, what will be the best technique to insert new record dynamically
Why do you think good technique for that matters? 5 to ten insertions per day into a data structure of thousands! If you needed to totally rebuild a complicated index across ten thousand records ten time a day, that still shouldn't be a performance problem worth worrying about.
So if "Exhaustive Searching" was really true, that could justify some serious performance consideration. But I think you might be talking about a problem so small that performance doesn't matter.
Maybe you should consider the use of a RDB such as sqlite, which will do efficient searching for you, and allow you to focus on the business logic of your application, rather than the minutia of small data structures. This can potentially add value to your data by making it accessible through other applications, and provide relatively easy methods for complex search criteria. Creation, retrieval, update and deletion (crud) of data comes pretty much for free with an SQL database tool.
What you want to minimize, in this "in memory" (sic...) data structure, is exactly one thing: virtual memory page-faults. Don't try to "save memory." Minimize the number of storage accesses required to locate a particular value.
But the previous suggestion, to throw the whole thing into an SQLite database file, is an excellent and superior suggestion.
Thousands is actually pretty trivial for a computer. So it might not matter much how you store the data.
I know thousands is pretty trivial. Thats why i was using arrays till date for same with no performance issues. But i want to optimize the code.
Quote:
Originally Posted by johnsfine
Searching for which field? Or is it both? Searching for exact match of a field? Or for partial match?
Searching on first field and retrieving second field corresponding to that. Search should be exact match.
Quote:
Originally Posted by johnsfine
Why do you think good technique for that matters? 5 to ten insertions per day into a data structure of thousands! If you needed to totally rebuild a complicated index across ten thousand records ten time a day, that still shouldn't be a performance problem worth worrying about.
So if "Exhaustive Searching" was really true, that could justify some serious performance consideration. But I think you might be talking about a problem so small that performance doesn't matter.
I asked for good technique to inform a running C binary about availability of new records not for inserting new records. As i told earlier even, i am using signals and Message queues for same currently. So, i just want to confirm if this method is ok or something better can be use.
And yes you are right, this problem is not that big. Currently i have thousands of data and this may extend to lakhs in coming time. So, should i continue using arrays or should i use dynamic data structure like hash. Please suggest.
Hmmm, well according to https://secure.wikimedia.org/wikipedia/en/wiki/Lakh = 100,000 in Western notation, so for data as per OP, that should all fit in mem; they're not very big fields.
Assuming that the 1st field is ALWAYS unique, then a hash would be good for fast searching as described, OTOH SQLLite would be easier to interface with generally.
If its always just a few new recs per day, then your current method of notification would work, but again using SQL would enable you to use other tools & be able to easily export as well.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.