Splitting a real-time data stream into multiple files
Long story short, I have a system that acquires data from an acquisition board at a rate of approx 2.7MB/s. The data rate is not the problem, the problem is it's continuous, 24/7. This comes out to over 200GB per day.
My problem is the data is coming out in binary format from a closed-source program that prints to stdout. I need a way to split the constant data stream out to a new file every ~10 minutes to keep the file sizes manageable, without losing anything. I also need to be able to then re-combine the data in some way for processing. Typically, the two codes (acquisition and processing) are run as a pair, piping the output of acquisition straight into processing, ie: Code:
./acquisition | ./process -i /dev/stdin Code:
./acquisition > outfile Code:
./process -i outfile On the processing front I'm envisioning some sort of timed cat with a fifo, ie: Code:
mkfifo myfifo I have a decent amount of experience in BASH scripting and FORTRAN programming, I'm just not sure what the best approach is for this situation. |
hi,
i assume you know when to cut the file for processing given is a binary file. I think an approach, using text files as sample, would be like this in perl: Code:
#!/usr/bin/perl ./acquisition|./splitter.pl and you'll get a list of files like this: outfile_0 outfile_1 ... ready.outfile_0 ready.outfile_1 ... then your process program would need to chech whenever a new ready.* is there process it and remove the ready.* file. Now here the variable size in the script will split every 10 lines, but its value is up to you depending on how much data you want to process at a time. |
Thanks, this works great.
The only issue is the processor usage is a bit high since this is running on a small 2GHz Atom. The acquisition code alone is around 80% CPU, and this perl script is another 10-15. Since I can barely tell my hand from my foot in perl, is there any way to reduce the overhead? Possibly by grabbing a block of lines at once? Ideally I'd like the splitting routine's overhead to be below 5%, since if the processor becomes saturated, it can start missing samples from the acquisition board. |
Two thoughts come to mind.
1. There may be a way to combine tee and split (or some other similar utilities) to accomplish what you need. Though, to be honest, I cannot think of a combination using those two that doesn't have a problem. 2. A "simple" C program. In fact, I've already got something coded up, but I haven't tested it. EDIT: Forgot to give a usage example earlier... Code:
./acquisition | ./simple_c_program | ./process -i /dev/stdin a. Save it as whatever.c b. Build it with "gcc -o whatever -lm whatever.c c. Run it with "./whatever [PATTERN]" or place it somewhere in your PATH to get rid of the './' Replace whatever with what you want the program to be named and whatever.c with whatever you save the source code as. PATTERN is any text to use as a filename. PATTERN can include one and only one instance of '%d'. That '%d' will have a numeric value substituted for it as the program runs. It starts at 0 and increments by one for each file split. The numeric value is also zero-padded and 5-digits wide. Need a bigger pad? Modify the "#define SUBSTITUTION_WIDTH" line to be something greater. A default pattern of "acquisition-%d.dat" will be used if no PATTERN is supplied. The numeric value will cycle back to 0 eventually. You will need to take steps to move any files (to save them from overwriting) if and when the numeric value cycles back to 0. The program attempts to read 4KiB chunks at a time. The files will break at 1.5GiB. The actual file could be up to 4KiB greater--depending on how the fread() calls return. Again, if you want bigger or smaller files, modify the "#define DEFAULT_BREAK_SIZE" to be whatever you prefer. The value is measured in bytes and probably shouldn't exceed 32 bits. Most of the code deals with reading the filename pattern, checking the pattern for errors, and modifying the pattern (to provide the zero pad and fixed-width). The actual read-write loop is relatively short and straightforward. Lastly, I don't know how this will fare performance-wise. But aside from the 4KiB buffer, I don't think it would be significant. Code:
#include <math.h> |
Wow! Thanks
There were a few bugs with it at first, but I have it running now. I'm going to let it cycle through a few files to make sure the data can be recombined properly afterward. It's only using about 5% of the proc though, less than half of perl. If you're interested, the bugs that I found were: 1) outputFile is opened for writing (not append) every iteration of the loop, so it just overwrites the data every time (at any given time it only contains the most recent 4096 bytes of data). I moved the block of code to open the outputFile inside of the "byteCounter >= DEFAULT_BREAK_SIZE" if statement, after filenameCounter is incremented. I also copied it to before the while loop to initialize the first file 2) bytesRead += fread( buffer, sizeof( char ), 4096, stdin ); should be bytesRead = fread( buffer, sizeof( char ), 4096, stdin ); otherwise bytesRead increases by 4096 every iteration of the loop, the value of which then gets added to byteCounter every iteration of the loop. The result is byteCounter increases exponentially with time, eg: (0, 4096, 12288, 24576, 40960, etc) rather than (0, 4096, 8192, 12288, 16384, 20480, etc). Thanks again. Assuming the files recombine properly (which I see no reason why they shouldn't), this should work just fine. |
EDIT:
Geez... I originally scanned over your comments and thought I knew what you were talking about. I guess I was reading what I wanted/expected to read, but not what was actually there. I changed the code to address what you mentioned... not what I thought you were saying originally :) EDIT2: Cleaned up the code a little more. Then again, maybe I added more bugs. That would be my luck. Anyway, glad it's working. |
Lol, I guess I missed your original reply. Sorry if my wording was ambiguous/confusing. Your modified code look just like what I have running now.
It's on the 5th file now, should be good enough to test the recombination and processing. |
Quote:
Only after I looked at it your response and the code a couple times did it click that "wait... something isn't right." |
The recombination and processing went smoothly, turns out that once the fifo fills up, it stops the cat from appending more data until the processing command pulls it out, which means I can do something as simple as:
Code:
mkfifo myfifo |
All times are GMT -5. The time now is 06:14 PM. |