LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Block based data reduction (https://www.linuxquestions.org/questions/programming-9/block-based-data-reduction-804372/)

yaplej 04-26-2010 05:15 PM

Block based data reduction
 
Hey everyone,

I am working on a method to reduce the amount of data sent in a TCP segment, and wanted to talk about it to see how plausible it is. I know this isn't perfect, but I want to get something working.

My idea so far is to break the TCP data into 100-200 byte blocks. The software would search a local repository for a match to this block, and depending on a match or not perform the appropriate action.

For a match the software would remove the original block insert a single byte prefix of "2", and then add the matching block 40-bit signature.

If no match were found the single byte prefix would be set to "0". The software would then try to compress the data. If successful the "0" prefix would be changed to "1". Another header would be added with the length of the compressed data. This is so the data can be decompressed later.

Then the next block would be checked using the same method. I also thought it might be a good idea to use the inverse of each block in the local repository also. This would increase the number of usable blocks, and not increase storage usage. It could be easy to use a "3" in the prefix to indicate an inverse block. The same signature would be used.

I am by no means an expert at this stuff so I wanted to get some help to refine the idea. I still need to figure out how the hosts will create the signatures, and keep track of what blocks other hosts have in their local repository. There is also having to deal with purging old blocks, or how to handle an emergency purge should the repository volume fill past a thresh hold.

kbp 04-27-2010 09:53 AM

If your application sends the same or similar data often you may be better off just sending the changed blocks, performing a lookup/replace may not be as efficient. There are several products that do this already... are you attempting to create a commercial product or just doing this for fun ?

yaplej 04-27-2010 10:50 AM

This is for an open source network accelerator I have been working on. The traffic does not originate from my application, and is not destined to it either. It does sit in the traffic path thats how its able to modify the traffic. I think doing a lookup/replace is the only option. http://packetsqueezer.portal.codespaces.com/

yaplej 04-29-2010 03:01 PM

Iv been working on this, and here is what I have so far. Does the theory sound good? I have gone over this multiple times, and each time I found small variations that improved on efficiency a little. It has to search through the block cache often so the disk will have a definite impact on its speed, but I figure by grouping blocks by a 32-bit hash will reduce the number of blocks that must be searched to find a match.


Block Based Data Reduction

Definitions:
Accelerator - A host running the acceleration software.
Accelerator ID - The numerically highest IP address of the host running the acceleration software.
Worker - A thread that is responsible for processing the packets of a particular TCP session. This thread does all the optimizations on the application data in the TCP segment.
Worker Queue - A linked list containing the IP packets of the TCP sessions assigned to a particular worker.
Fetcher - The thread that receives IP packets from the kernel netfilter queue, and places them into the worker queues. Its also responsible for tracking new sessions, and determining what worker the new session will be assigned to.
Block - A 64,128, or 256 byte unique pattern of 0's and 1's. Have not determined what size will be used, or if it will be configurable.
Block Cache - The collection of Blocks usable by the software separated into folders based on their 32-bit hash for faster searching/look-up.
Block Manager - The thread responsible for creating new blocks in the block cache.
Block Request - A copy of a block that does not exists sent from the Worker to the Block Manager.
Block Signature - A unique 40-bit ID used to represent a single block of data created by the Block Manager.
Communication Service - Client/Server communication mechanism between accelerators.

Description:
This system for block based data reduction is capable of a meshed or any-to-any multi-system environment. A single copy of each block can be used between any number of accelerators provided a local-to-remote mapping exists between the source, and destination accelerators. A unique block cache is not needed for each new accelerator in the network. The block cache is entirely managed by the block manager. No other process writes, or erases blocks from the block cache. Any other process only searches, and reads blocks in the block cache.

Request for Blocks Creation:
  • When a packet enters the source accelerator to be accelerated, but the worker does not find a matching block in the current block cache. In this case the worker makes a copy of the block, and submits it to the block manager with an accelerator ID of "0" to add this block to the block cache.
  • When a packet enters the destination accelerator to undo any acceleration, but the worker finds a block signature was not used. The worker makes a copy of this block, and sends it to the block manager with the source accelerator ID.

Block Manager:
This is the working thread that is responsible for determining if a new block can, or should be created from a block request. It processes block requests from a linked list queue that the workers submit their block requests to. The block manager checks the accelerator ID first. If its "0" this is just a request to create a new block. The block manager checks the available disk space of the block cache. If the free space is below the "FULL" threshold it will not create a signature. It frees this block request, and continues to the next block request. It also sets a flag for the block maintenance thread to run. If there is enough free space it first searches the block cache for an existing match. This is because multiple workers might have requested this block before it is actually created the block could also be requested in multiple packets from a single worker. If a match is found the block request is freed, and the block manager moves on to the next block request. If no match is found then the manager creates a unique signature for this block, saves the block into the block cache, and moves on to the next block request. If the accelerator ID was something other than "0" its a request to also create a source-to-destination mapping. The accelerator tries to locate a matching block in the block cache. If no match is found it will do the same checks to create a new one. If a match is found, or its able to create a new block signature the block is locked so it cannot be deleted, and the block signature is sent to the communication service with the source accelerator ID. Before sending the mapping request to the communication service it first searches the queue of current communication requests to see if a request for this block already exists. If there is it just continues on to the next block request, and does not create a duplicate mapping requests.

Mapping Blocks to Between Accelerators:
When a packet enters the destination accelerator to undo any acceleration. The worker finds a block signature was not used it would make a copy of the block then send the block, and the source accelerator ID to block manager as a block request. When the block manager receives this block request it would notice the source accelerator ID, and try to locate this block in the block cache. If a block was found it or created it would then place a lock on that block. Then the block manager would send the block signature to the communication service along with the source accelerator ID to request for a signature mapping. The communication service will then send the destination accelerator block signature, and the block to the source accelerator. The source accelerator will try to locate the block in its block cache. If a match is found a lock is placed on that block, and the source accelerator will respond with the source block signature, and the destination block signature to indicate that a block exists in the source accelerator, if no match is found it will respond with the signature block empty or 0. This notifies the destination accelerator that there was not a signature for the block in the source accelerator. The destination accelerator will receive the notice, and respond "OK". At that point both accelerators can release the lock on that particular block having made the source-to-destination mapping or not.

Sending Block Signatures:
When the source accelerator receives a packet to accelerate, and is able to find a matching block in the block cache it checks to see if a source-to-destination maps exists for that block to the destination accelerator. If a source to destination block signature map does exist then the local block signature is placed into the packet for that block.

When the destination accelerator receives a packet, and finds a signature was used it searches its source-to-destination table for the block signature, and accelerator ID. If a map is found it replaces the block signature with the data of that local block in the packet. If no map is found then the worker sends an emergency alert to the communication service that no source-to-destination map was found for that block signature. The source accelerator will then remove the source-to-destination map for that signature to that accelerator ID. The worker will continue doing this for all the block in the TCP segment, but if even one block signature mapping is missing it will drop the entire packet. TCP will retransmit the packet, and the source accelerator should now have removed the bad local-to-remote maps.

kbp 04-29-2010 05:16 PM

I'm just wondering about your approach... maybe creating protocol specific compression algorithms would be a cleaner approach:

- check connection against cache / add connection details to cache
- read packet headers to determine upper layer protocols in use
- check for matches against algorithm list
- compress application data and repackage

yaplej 04-29-2010 05:33 PM

I already have a compress mechanism in the accelerator that compresses the application data in the TCP segment. This is another layer of acceleration I am working on. So it would apply multiple methods of acceleration in layers. First data reduction by swapping out blocks of data for signatures. Second compressing the entire data section of the TCP segment after the block based acceleration has been done. The remote/destination accelerator would then just undo everything in starting with compression first.

Later I will be adding application specific acceleration also such as SMB/CIFS acceleration by using differences when files are saved so less traffic is transmitted back over the slower WAN link. This feature will greatly help async network circuits with slower upload speeds. Also help reduce the amount of time to save a file after the user modifies it. This would work in conjuction with the other two features too. Getting ahead of myself though.


All times are GMT -5. The time now is 09:46 PM.