LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-15-2012, 11:29 PM   #1
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,362

Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
Splitting a real-time data stream into multiple files


Long story short, I have a system that acquires data from an acquisition board at a rate of approx 2.7MB/s. The data rate is not the problem, the problem is it's continuous, 24/7. This comes out to over 200GB per day.

My problem is the data is coming out in binary format from a closed-source program that prints to stdout. I need a way to split the constant data stream out to a new file every ~10 minutes to keep the file sizes manageable, without losing anything. I also need to be able to then re-combine the data in some way for processing.

Typically, the two codes (acquisition and processing) are run as a pair, piping the output of acquisition straight into processing, ie:

Code:
./acquisition | ./process -i /dev/stdin
I need a way to split these two processes, so the output of acquisition is saved for post-processing, ie:

Code:
./acquisition > outfile
Then sometime later on...
Code:
./process -i outfile
The way I have described works fine, but the "outfile" becomes unusably large after just a couple of hours.

On the processing front I'm envisioning some sort of timed cat with a fifo, ie:
Code:
mkfifo myfifo
./process -i myfifo
cat file1 >> myfifo
sleep 30
cat file2 >> myfifo
sleep 30
etc.
But I'm not sure what to do on the acquisition side of things.

I have a decent amount of experience in BASH scripting and FORTRAN programming, I'm just not sure what the best approach is for this situation.

Last edited by suicidaleggroll; 02-15-2012 at 11:31 PM.
 
Old 02-16-2012, 12:03 PM   #2
rugdog
LQ Newbie
 
Registered: Feb 2012
Posts: 1

Rep: Reputation: Disabled
hi,

i assume you know when to cut the file for processing given is a binary file. I think an approach, using text files as sample, would be like this in perl:

Code:
#!/usr/bin/perl
$file_count=0;
open(FH," > outfile_$file_count");
while(<>){
	if ( $size == 10 ){
		close(FH);
		open(D,"> ready.outfile_$file_count");
		close(D);
		++$file_count;
		open(FH," > outfile_$file_count");
		$size=0;
	}
	print FH $_;
	++$size;
}
close(FH);
say the above script is named splitter.pl you'd call it:

./acquisition|./splitter.pl

and you'll get a list of files like this:

outfile_0
outfile_1
...
ready.outfile_0
ready.outfile_1
...

then your process program would need to chech whenever a new ready.* is there process it and remove the ready.* file.

Now here the variable size in the script will split every 10 lines, but its value is up to you depending on how much data you want to process at a time.
 
Old 02-22-2012, 12:15 PM   #3
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,362

Original Poster
Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
Thanks, this works great.
The only issue is the processor usage is a bit high since this is running on a small 2GHz Atom. The acquisition code alone is around 80% CPU, and this perl script is another 10-15.

Since I can barely tell my hand from my foot in perl, is there any way to reduce the overhead? Possibly by grabbing a block of lines at once? Ideally I'd like the splitting routine's overhead to be below 5%, since if the processor becomes saturated, it can start missing samples from the acquisition board.
 
Old 02-22-2012, 02:49 PM   #4
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
Two thoughts come to mind.

1. There may be a way to combine tee and split (or some other similar utilities) to accomplish what you need. Though, to be honest, I cannot think of a combination using those two that doesn't have a problem.

2. A "simple" C program. In fact, I've already got something coded up, but I haven't tested it.

EDIT:
Forgot to give a usage example earlier...
Code:
./acquisition | ./simple_c_program | ./process -i /dev/stdin
/EDIT

a. Save it as whatever.c
b. Build it with "gcc -o whatever -lm whatever.c
c. Run it with "./whatever [PATTERN]" or place it somewhere in your PATH to get rid of the './'

Replace whatever with what you want the program to be named and whatever.c with whatever you save the source code as.

PATTERN is any text to use as a filename. PATTERN can include one and only one instance of '%d'. That '%d' will have a numeric value substituted for it as the program runs. It starts at 0 and increments by one for each file split. The numeric value is also zero-padded and 5-digits wide. Need a bigger pad? Modify the "#define SUBSTITUTION_WIDTH" line to be something greater. A default pattern of "acquisition-%d.dat" will be used if no PATTERN is supplied.

The numeric value will cycle back to 0 eventually. You will need to take steps to move any files (to save them from overwriting) if and when the numeric value cycles back to 0.

The program attempts to read 4KiB chunks at a time. The files will break at 1.5GiB. The actual file could be up to 4KiB greater--depending on how the fread() calls return. Again, if you want bigger or smaller files, modify the "#define DEFAULT_BREAK_SIZE" to be whatever you prefer. The value is measured in bytes and probably shouldn't exceed 32 bits.

Most of the code deals with reading the filename pattern, checking the pattern for errors, and modifying the pattern (to provide the zero pad and fixed-width). The actual read-write loop is relatively short and straightforward.

Lastly, I don't know how this will fare performance-wise. But aside from the 4KiB buffer, I don't think it would be significant.

Code:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SUBSTITUTION_PATTERN                                        "%d"
#define SUBSTITUTION_WIDTH                                            5

#define DEFAULT_FILENAME_PATTERN                    "acquisition-%d.dat"

/* default file break size is 1.5GiB */
#define DEFAULT_BREAK_SIZE                                   0x60000000

int main( int argc, char *argv[] )
{
  FILE *outputFile;
  long unsigned byteCounter;
  size_t bytesRead;
  char buffer[4096];

  char *sourcePattern;
  char *filenamePattern;
  char *filename;
  char *literalPercent;
  char *nextPercent;
  unsigned substitutionCount;
  unsigned filenameLength;
  unsigned filenameCounter;
  unsigned counterMax = pow( 10, SUBSTITUTION_WIDTH );

  /* Given a filename pattern on the command line or use default? */
  if( argc > 1 )
    sourcePattern = argv[1];
  else
    sourcePattern = DEFAULT_FILENAME_PATTERN;

  /* Save a copy of the pattern to work with */
  filenameLength = strlen( sourcePattern );
  filenamePattern = malloc( sizeof( char ) * ( filenameLength + 1 ) );
  strncpy( filenamePattern, sourcePattern, filenameLength );
  filenamePattern[filenameLength] = '\0';

  /* Count the number of file number substitutions requested */
  substitutionCount = 0;
  sourcePattern = filenamePattern;
  do
  {
    /* Find first % -- used for error checking */
    nextPercent = strstr( sourcePattern, "%" );

    /* Find first %% -- ok because printf() will replace with a single
       '%' and that will not risk overflowing memory reserved for filename */
    literalPercent = strstr( sourcePattern, "%%" );
    sourcePattern = strstr( sourcePattern, SUBSTITUTION_PATTERN );
    if( nextPercent != sourcePattern &&
	nextPercent != literalPercent )
    {
      fprintf( stderr,
	       "ERROR: Filename pattern uses a printf style substitution\n"
	       "       other than %%d. This is not supported.\n" );
      return 1;
    }
    if( sourcePattern != NULL )
    {
      substitutionCount++;
      sourcePattern += 1;
    }
  } while( sourcePattern != NULL );

  /* Only a single substitution is supported */
  if( substitutionCount > 1 )
  {
    fprintf( stderr,
	     "ERROR: Filename pattern uses %%d substitution more than\n"
	     "       once. This is not supported.\n" );
      return 1;
  }

  /* 
   * Now make a slight modification to the original pattern for zero
   * padded, fixed-width filenameCounter substitutions.
   * The net effect is to change "%d" to "%05d" (or equivalent based
   * on previous #defines
   */
  sourcePattern = filenamePattern;
  filenameLength = strlen( sourcePattern ) + 2;
  filenamePattern = malloc( sizeof( char ) * ( filenameLength + 1 ) );
  nextPercent = strstr( sourcePattern, SUBSTITUTION_PATTERN );
  *nextPercent = '\0';
  snprintf( filenamePattern, filenameLength + 1, "%s%%0%dd%s", sourcePattern,
	    SUBSTITUTION_WIDTH, nextPercent + strlen( SUBSTITUTION_PATTERN ) );
  filenamePattern[filenameLength] = '\0';

  /* Can now free the original pattern because we have the zero padded,
     fixed-width version in filenamePattern */
  free( sourcePattern );

  /* Reserve memory large enough for the pattern and all requested
     substitutions */
  filenameLength = strlen( filenamePattern ) + SUBSTITUTION_WIDTH -
    ( strlen( SUBSTITUTION_PATTERN ) + 2 );
  filename = malloc( sizeof( char ) * ( filenameLength + 1 ) );
  filename[filenameLength-1] = '\0';


  /* start the main read-write loop */
  byteCounter = 0;
  filenameCounter = 0;
  while(! feof( stdin ) )
  {
    if( byteCounter == 0 )
    {
      snprintf( filename,
		filenameLength + 1,
		filenamePattern,
		filenameCounter );

      outputFile = fopen( filename, "w" );
      if( outputFile == NULL )
      {
	fprintf( stderr,
		 "ERROR: Unable to open \"%s\" for writing.\n",
		 filename );
	return 1;
      }
    }

    bytesRead = fread( buffer, sizeof( char ), 4096, stdin );
    fwrite( buffer, sizeof( char ), bytesRead, stdout );
    fwrite( buffer, sizeof( char ), bytesRead, outputFile );

    byteCounter += bytesRead;
    if( byteCounter >= DEFAULT_BREAK_SIZE )
    {
      fclose( outputFile );
      byteCounter = 0;
      filenameCounter = ( filenameCounter + 1 ) % counterMax;
    }
  }

  free( filenamePattern );
  free( filename );

  return 0;
}

Last edited by Dark_Helmet; 02-22-2012 at 05:01 PM.
 
Old 02-22-2012, 04:36 PM   #5
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,362

Original Poster
Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
Wow! Thanks

There were a few bugs with it at first, but I have it running now. I'm going to let it cycle through a few files to make sure the data can be recombined properly afterward. It's only using about 5% of the proc though, less than half of perl.

If you're interested, the bugs that I found were:
1) outputFile is opened for writing (not append) every iteration of the loop, so it just overwrites the data every time (at any given time it only contains the most recent 4096 bytes of data). I moved the block of code to open the outputFile inside of the "byteCounter >= DEFAULT_BREAK_SIZE" if statement, after filenameCounter is incremented. I also copied it to before the while loop to initialize the first file

2) bytesRead += fread( buffer, sizeof( char ), 4096, stdin );
should be
bytesRead = fread( buffer, sizeof( char ), 4096, stdin );

otherwise bytesRead increases by 4096 every iteration of the loop, the value of which then gets added to byteCounter every iteration of the loop. The result is byteCounter increases exponentially with time, eg: (0, 4096, 12288, 24576, 40960, etc) rather than (0, 4096, 8192, 12288, 16384, 20480, etc).


Thanks again. Assuming the files recombine properly (which I see no reason why they shouldn't), this should work just fine.

Last edited by suicidaleggroll; 02-22-2012 at 04:37 PM.
 
Old 02-22-2012, 04:41 PM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
EDIT:
Geez... I originally scanned over your comments and thought I knew what you were talking about. I guess I was reading what I wanted/expected to read, but not what was actually there.

I changed the code to address what you mentioned... not what I thought you were saying originally

EDIT2:
Cleaned up the code a little more. Then again, maybe I added more bugs. That would be my luck. Anyway, glad it's working.

Last edited by Dark_Helmet; 02-22-2012 at 05:02 PM.
 
Old 02-22-2012, 05:03 PM   #7
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,362

Original Poster
Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
Lol, I guess I missed your original reply. Sorry if my wording was ambiguous/confusing. Your modified code look just like what I have running now.

It's on the 5th file now, should be good enough to test the recombination and processing.
 
Old 02-22-2012, 05:10 PM   #8
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
Quote:
Originally Posted by suicidaleggroll
Sorry if my wording was ambiguous/confusing.
Not at all. You were quite clear. The areas of the code you were referencing had other bugs in them when I originally posted the code. I fixed them in a later edit. So when you started talking about those areas, I immediately assumed you were talking about the pre-edit bugs.

Only after I looked at it your response and the code a couple times did it click that "wait... something isn't right."
 
Old 02-24-2012, 11:04 AM   #9
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,362

Original Poster
Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
The recombination and processing went smoothly, turns out that once the fifo fills up, it stops the cat from appending more data until the processing command pulls it out, which means I can do something as simple as:

Code:
mkfifo myfifo
./process -i myfifo

then in another terminal

for i in acquisition*; do
   echo Processing $i
   cat $i >> myfifo
done
and it works beautifully
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Stream a webcam in real-time? hunternet93 Linux - General 4 02-15-2014 06:04 PM
[SOLVED] Splitting text file into multiple files Mithrilhall Programming 5 12-10-2010 11:32 AM
LXer: Tutorial: consuming Twitter's real-time stream API in Python LXer Syndicated Linux News 0 04-21-2010 10:00 PM
Real-Time Multiple Stream Video Capture Notwerk Linux - Software 2 09-21-2006 03:43 PM
Script: splitting lines in multiple files and joining them timmay9162 Programming 28 04-14-2006 09:52 AM


All times are GMT -5. The time now is 03:55 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration