any ideas to improve search performance in a tree structure

George2 · 06-08-2006, 12:50 AM

Hello everyone,

I am developing an application which contains a tree structure. The tree structure reflects the management structure of a big company -- child node represents the employees managed by direct manager (parent node), one parent node may have multiple (various) number of child nodes (employees he/she managed in his/her department).

I need to find all the employees a manager managed in the tree structure, the direct child node, the child node's child node, ... and so on (for example, if he is a senior manager, there are multiple levels of child nodes).

Currently, I am using brute force tree traverse approach to print all (direct and indirect) child nodes -- layer by layer. The tree is very large. I am wondering whether there are any good ideas about how to improve the search performance?

thanks in advance,
George

paulsm4 · 06-08-2006, 01:03 AM

There's no reason you can't have multiple data structures referring to the same underlying data.

Let's say, for example, that all of your employee records were organized in an array (yes, I know they're not - but please bear with me). You simply put the first record into "a[0]", the second into "a[1]", and so on. In this example, you could still have your tree (to describe the organization's structure) ... but now the tree would contain array indexes (0, 1, ...) instead of the data records themselves.

Now generalize one step further. It doesn't really matter how the records are stored. Nor does it necessarily matter how you access them: traverse a tree, loop through an array, iterate a linked list ... whatever. You can build as many different data structures around your data as you have ways of looking up that data. The only thing that REALLY matters is that you make sure all of your "indexes" and "pointers" remain in synch with the underlying data.

Having said that:
1. Yes, having a tree is definitely the best way to represent your management structure.

2. Using a hash, however, would be a much quicker way to look up a specific name.

3. You might consider using both a tree and a hash, and writing a "batch mode" script
that periodically regenerates the hash (say, once every hour) to accomodate any new
records that might have been added, or old records that might have been deleted.

4. Generalizing further, you might create different subtrees for each department, and have
an array containing the root node for each subtree corresponding to each department.

5. The possibilities are really limitless.

Just a thought .. PSM

PS:
Both are fairly difficult reading, but I'd highly recommend either Sedgewick and/or Knuth for their classic books on algorithms and data structures!

chrism01 · 06-08-2006, 02:30 AM

Sounds like a parent-child classic arrangement in SQL ...
Make the RDBMS do all the heavy lifting ie data retrieval.
Just define the tables and their foreign keys.

George2 · 06-08-2006, 03:14 AM

Hi chrism01,

Quote:

Originally Posted by chrism01

Sounds like a parent-child classic arrangement in SQL ...
Make the RDBMS do all the heavy lifting ie data retrieval.
Just define the tables and their foreign keys.

The data are stored in a database. But searching database in slower than search in memory. So, I read all the data from database to memory and then search them in memory. Any ideas to improve it?

regards,
George

George2 · 06-08-2006, 03:26 AM

Thank you paulsm4,

Seems you have a lot of good points. :-)

Quote:

Originally Posted by paulsm4

2. Using a hash, however, would be a much quicker way to look up a specific name.

If there are thousands of nodes, how to find a good hash function to reduce conflicts? Any comments?

Quote:

Originally Posted by paulsm4

3. You might consider using both a tree and a hash, and writing a "batch mode" script
that periodically regenerates the hash (say, once every hour) to accomodate any new
records that might have been added, or old records that might have been deleted.

I can not imagine how to use a tree and a hash together. But it seems that in this way, we can combine both the advantages of tree and hash table. So, could you talk a little more about this ideas?

Quote:

Originally Posted by paulsm4

4. Generalizing further, you might create different subtrees for each department, and have
an array containing the root node for each subtree corresponding to each department.

Do you mean setup different trees with different root node? If it is, I do not think it can significantly improve the performance. Could you point out why this approach can improve performance?

regards,
George

graemef · 06-08-2006, 07:44 AM

Looking at you specific question, "How can you improve the speed of finding all employees a manager is responsible for, including different levels?" Well if memory is not a problem then you can adjust you tree as follows:

I am assuming that your tree is as follows:

Code:

ManagerNode
{
   struct ManagerDetails
   vector of SubordinateNodes
}

The tree nodes consist of the personal details plus a list of the subordinate nodes, this is probably a vector so that it can grow efficiently. The SubordinateNode is of course just a ManagerNode but using different names is helpful here. To traverse the tree all the nodes in the SubordinateNode vector will need to be visited.

Now the problem is that to find all the subordinates, including nested levels takes more time. By developing a slight modification to the structure you can easily achieve this:

Code:

ManagerNode
{
   struct ManagerDetails
   vector of SubordinateNodes
   vector of AllSubordinateNodes
}

I'd add a second vector which essentially includes all the subordinates at all levels, infact there is no need to keep the immediate subordinates in that vector, because you have them in the first vector. This makes it really easy to find out the immediate subordinates plus all the subordinates.

The real problem is in setting this structure up. One way to do this would be when adding an entry to the SubordinateNodes also add it to every supervisors AllSubordinateNodes, to do this you need to be aware of who the Supervisor is so it is necessary to add a upward traveral in the tree, it will make you tree like a double linked list:

Code:

ManagerNode
{
   struct ManagerDetails
   vector of SubordinateNodes
   vector of AllSubordinateNodes
   pointer to SupervisorNode
}

ps. It probably goes without saying but: all references to *Node or *Nodes wil be to the ManagerNode structure, thus SupervisorNode is a ManagerNode, SubordianteNodes are ManagerNodes.

George2 · 06-08-2006, 10:27 PM

Thank you graemef!

Quote:

Originally Posted by graemef

The real problem is in setting this structure up. One way to do this would be when adding an entry to the SubordinateNodes also add it to every supervisors AllSubordinateNodes, to do this you need to be aware of who the Supervisor is so it is necessary to add a upward traveral in the tree, it will make you tree like a double linked list:

Code:

ManagerNode
{
   struct ManagerDetails
   vector of SubordinateNodes
   vector of AllSubordinateNodes
   pointer to SupervisorNode
}

ps. It probably goes without saying but: all references to *Node or *Nodes wil be to the ManagerNode structure, thus SupervisorNode is a ManagerNode, SubordianteNodes are ManagerNodes.

You have presented a good idea. But I am still confused about how to add the reference of current node to all its upper level manager nodes. For example, for a low level employee, we need to add the reference node to all his/her upper level managers, so I think we need to store a pointer to the parent node to the current node in order to traverse upward, right? What I am confused is about whether the pointer "SupervisorNode" means the pointer to parent node in your points?

regards,
George

graemef · 06-08-2006, 10:38 PM

Yes I was thinking that the supervisorNode would be the parent.

So you would have a loop that would visit the supervisorNode add the subordinate to the allSubordinate vector and then repeat until the supervisor Node was null (presumably being the god node herself.)

George2 · 06-09-2006, 01:14 AM

Thank you graemef! I understand your idea now -- we are using different terminology before.

Quote:

Originally Posted by graemef

Yes I was thinking that the supervisorNode would be the parent.

So you would have a loop that would visit the supervisorNode add the subordinate to the allSubordinate vector and then repeat until the supervisor Node was null (presumably being the god node herself.)

Anyway, do you think using hash table together with the tree is better, as suggested by paulsm4 in the second post of this thread?

regards,
George

primo · 06-09-2006, 04:49 AM

There are natural trees implemented in filesystems and they most likely have efficient lookup mechanisms. You just name each directory after every individual and files inside would contain data about each one. There's no need at all to "move" nodes as you can easily have the same guy in more than one place. Also, you may use hard links on the files contained in these directories and benefit from the inode abstraction that substracts one everytime a file is unlinked that gets deleted when it reaches zero. Then making a backup of this filesystem database is as easy as doing a tar.bz2. It's mostly portable to any platform provided you use ASCII or some sort of portable format instead of floating-point or integers values.

P.S: This database may even be encrypted transparently with encrypted filesystems. Locking mechanisms may be easier too.