How to split a file into more sub files

michaelyu33 · 04-13-2006, 03:48 PM

Hello All,

I have a report file in regular text format, which was concatnated by 5 different BASE files by my Web team. In the report file, I have first 20 lines represent the first base file. Then 2 blank lines, next will be the second base file, then 2 blank lines,...until the 5th base file.

It looks like the following (In my report file):
12, test1
....
....
20, test20

get, report2, name
fiad, dfdfd, dfdfd
....
....
....
dff, fdfd, fdfd

get, report3, file, time
dfd, fdfd, fdfd, rrf
...
...
fdfd, hhg, ere, erer

What I want is to split one report file back to five base files. In the report file, my 5 portion are seperated by 2 lines of space. I have tried CSPILT, AWK, and CUT. It just doesn't work out. Please help...

toreric · 04-13-2006, 05:34 PM

Try the line editor ed. I haven't used it recently but would apply it like

First read the file, then
repeat until the file is empty:
find the double empty lines
write line (1,.) to a new file
delete line (1,.)
endrepeat
done!

Read "man ed" and work it out!

michaelyu33 · 04-14-2006, 08:30 AM

Thank you so much for the reply. I will try that. If possible, would you please provide a sample code?

Thank
Michael

michaelyu33 · 04-14-2006, 09:14 AM

I have found the exact post on this forum athttp://www.linuxquestions.org/questions/showthread.php?t=182909. But when I tried the perl script in that post, it didn't work out. I have modified the perl as follow:
#!/usr/bin/perl
#
use strict;
use IO::Handle;

my ($line, $nr);

my $thebigfile = "/home/oracle/projects/Achaya/test/wbreports.txt"; # input file location
my $logfile = "/home/oracle/projects/Achaya/test/newwb"; # output files basename

my ($previousFileTimeSize, $currentFileTimeSize);
$previousFileTimeSize = 1;

print "START\n";
open(LOGFILE, ">$logfile");
LOGFILE->autoflush(1);
while (1) {
$currentFileTimeSize = (stat($thebigfile))[7]; # size
print $currentFileTimeSize;
if ($currentFileTimeSize != $previousFileTimeSize) {
print LOGFILE scalar localtime;
print LOGFILE ": sent-mail MODIFIED\n";
$previousFileTimeSize = $currentFileTimeSize;
} else {
print LOGFILE scalar localtime;
print LOGFILE ": sent-mail no modification\n";
}
sleep 30;
}
close LOGFILE;

Unfortunately, no outfile was generated.

Michael

david_ross · 04-14-2006, 09:30 AM

Using head and tail would probably be quicker:
#!/bin/bash

head -n 20 /tmp/report.txt > /tmp/part.1
head -n 42 /tmp/report.txt | tail -n 20 > /tmp/part.2
head -n 64 /tmp/report.txt | tail -n 20 > /tmp/part.3
head -n 86 /tmp/report.txt | tail -n 20 > /tmp/part.4

Dogmatix · 04-14-2006, 10:12 AM

Csplit is a contextual splitter, so you can split files depending on matching lines. Head and tail would work, but you don't need to know how many lines there are with csplit.

For your file, run something like this:

csplit -z infile /"get,"/ '{*}'

You'll get files xx00, xx01, etc. xx00 contains the part of infile from the start up to the first matching line. xx01 contains the matching line and up to the next matching line. Etc.

Check the man page for more stuff that it'll do.

Dogmatix

edit: fixed argument order...

toreric · 04-14-2006, 10:14 AM

Or simply use ed. If 'text5.txt' is the file with the five sections of arbitrary length subdivided by double empty lines, and if you prepare the content of the file 'edinp' like this:

Code:

e text5.txt
/^$/
/^$/
1,.w part1
1,.d
/^$/
/^$/
1,.w part2
1,.d
/^$/
/^$/
1,.w part3
1,.d
/^$/
/^$/
1,.w part4
1,.d
/^$/
/^$/
1,.w part5
q

Then run the command 'ed < edinp' to produce the five part# files.

P.S. You may extend edinp with more part#s than are actually present in the input file with no harm. Then ed will gracefully exit with ?. And, of course, this approach permits that you change the section location regexp(s) to something more relevant for each section in cases when two empty lines wouldn't suffice. Nice old line editor!

michaelyu33 · 04-14-2006, 01:09 PM

Thank you all for your help. Because I will get the big report file daily, I need a program to split it into 5 small files base on the space lines. I tried the csplit, it just won't take the blank space as a pattern to split the file. I would like to stick with the solution posted in the previous post at : http://www.linuxquestions.org/questi...d.php?t=182909.

It looks like that's the reasonable solution for my case. Unfortuanlly, I am not perl guy. I got stuck on writing the output to the file.

Thanks and have a great weekend
Michael

Dogmatix · 04-15-2006, 09:09 AM

Most *nix utilities are line-based, so you can't search for two blank lines, only one. I thought each report had a similar header that you could search for ("get," in your example). If not, csplit won't work. Another thought: you could use sed to look for a blank line, and replace it with unique string, the use csplit to find it, then use sed again to change the unique string back to a blank line.

Searching for a blank line in csplit is easy. Use "/^$/" as the regexp. You'll end up with 9 files, though, since 4 of them will be just one blank line. If you just ignore them and use xx00, xx02, xx04, xx06, and xx08, you'll have your 5 reports.

Or, you could whip up a perl script. I'd be tempted to write a short C program since I don't know perl either.

Dogmatix

toreric · 04-15-2006, 09:23 AM

Or, if you know neither Perl nor C/C++ very well but some Bash: Just make a nice Bash script where the Ed line editor is utilized: a straightforward way to obtain the desired functionality in five minutes...