Iv been working on this, and here is what I have so far. Does the theory sound good? I have gone over this multiple times, and each time I found small variations that improved on efficiency a little. It has to search through the block cache often so the disk will have a definite impact on its speed, but I figure by grouping blocks by a 32-bit hash will reduce the number of blocks that must be searched to find a match.
Block Based Data Reduction
Definitions:
Accelerator - A host running the acceleration software.
Accelerator ID - The numerically highest IP address of the host running the acceleration software.
Worker - A thread that is responsible for processing the packets of a particular TCP session. This thread does all the optimizations on the application data in the TCP segment.
Worker Queue - A linked list containing the IP packets of the TCP sessions assigned to a particular worker.
Fetcher - The thread that receives IP packets from the kernel netfilter queue, and places them into the worker queues. Its also responsible for tracking new sessions, and determining what worker the new session will be assigned to.
Block - A 64,128, or 256 byte unique pattern of 0's and 1's. Have not determined what size will be used, or if it will be configurable.
Block Cache - The collection of Blocks usable by the software separated into folders based on their 32-bit hash for faster searching/look-up.
Block Manager - The thread responsible for creating new blocks in the block cache.
Block Request - A copy of a block that does not exists sent from the Worker to the Block Manager.
Block Signature - A unique 40-bit ID used to represent a single block of data created by the Block Manager.
Communication Service - Client/Server communication mechanism between accelerators.
Description:
This system for block based data reduction is capable of a meshed or any-to-any multi-system environment. A single copy of each block can be used between any number of accelerators provided a local-to-remote mapping exists between the source, and destination accelerators. A unique block cache is not needed for each new accelerator in the network. The block cache is entirely managed by the block manager. No other process writes, or erases blocks from the block cache. Any other process only searches, and reads blocks in the block cache.
Request for Blocks Creation:
- When a packet enters the source accelerator to be accelerated, but the worker does not find a matching block in the current block cache. In this case the worker makes a copy of the block, and submits it to the block manager with an accelerator ID of "0" to add this block to the block cache.
- When a packet enters the destination accelerator to undo any acceleration, but the worker finds a block signature was not used. The worker makes a copy of this block, and sends it to the block manager with the source accelerator ID.
Block Manager:
This is the working thread that is responsible for determining if a new block can, or should be created from a block request. It processes block requests from a linked list queue that the workers submit their block requests to. The block manager checks the accelerator ID first. If its "0" this is just a request to create a new block. The block manager checks the available disk space of the block cache. If the free space is below the "FULL" threshold it will not create a signature. It frees this block request, and continues to the next block request. It also sets a flag for the block maintenance thread to run. If there is enough free space it first searches the block cache for an existing match. This is because multiple workers might have requested this block before it is actually created the block could also be requested in multiple packets from a single worker. If a match is found the block request is freed, and the block manager moves on to the next block request. If no match is found then the manager creates a unique signature for this block, saves the block into the block cache, and moves on to the next block request. If the accelerator ID was something other than "0" its a request to also create a source-to-destination mapping. The accelerator tries to locate a matching block in the block cache. If no match is found it will do the same checks to create a new one. If a match is found, or its able to create a new block signature the block is locked so it cannot be deleted, and the block signature is sent to the communication service with the source accelerator ID. Before sending the mapping request to the communication service it first searches the queue of current communication requests to see if a request for this block already exists. If there is it just continues on to the next block request, and does not create a duplicate mapping requests.
Mapping Blocks to Between Accelerators:
When a packet enters the destination accelerator to undo any acceleration. The worker finds a block signature was not used it would make a copy of the block then send the block, and the source accelerator ID to block manager as a block request. When the block manager receives this block request it would notice the source accelerator ID, and try to locate this block in the block cache. If a block was found it or created it would then place a lock on that block. Then the block manager would send the block signature to the communication service along with the source accelerator ID to request for a signature mapping. The communication service will then send the destination accelerator block signature, and the block to the source accelerator. The source accelerator will try to locate the block in its block cache. If a match is found a lock is placed on that block, and the source accelerator will respond with the source block signature, and the destination block signature to indicate that a block exists in the source accelerator, if no match is found it will respond with the signature block empty or 0. This notifies the destination accelerator that there was not a signature for the block in the source accelerator. The destination accelerator will receive the notice, and respond "OK". At that point both accelerators can release the lock on that particular block having made the source-to-destination mapping or not.
Sending Block Signatures:
When the source accelerator receives a packet to accelerate, and is able to find a matching block in the block cache it checks to see if a source-to-destination maps exists for that block to the destination accelerator. If a source to destination block signature map does exist then the local block signature is placed into the packet for that block.
When the destination accelerator receives a packet, and finds a signature was used it searches its source-to-destination table for the block signature, and accelerator ID. If a map is found it replaces the block signature with the data of that local block in the packet. If no map is found then the worker sends an emergency alert to the communication service that no source-to-destination map was found for that block signature. The source accelerator will then remove the source-to-destination map for that signature to that accelerator ID. The worker will continue doing this for all the block in the TCP segment, but if even one block signature mapping is missing it will drop the entire packet. TCP will retransmit the packet, and the source accelerator should now have removed the bad local-to-remote maps.