How to split a file into more sub files
Hello All,
I have a report file in regular text format, which was concatnated by 5 different BASE files by my Web team. In the report file, I have first 20 lines represent the first base file. Then 2 blank lines, next will be the second base file, then 2 blank lines,...until the 5th base file. It looks like the following (In my report file): 12, test1 .... .... 20, test20 get, report2, name fiad, dfdfd, dfdfd .... .... .... dff, fdfd, fdfd get, report3, file, time dfd, fdfd, fdfd, rrf ... ... fdfd, hhg, ere, erer What I want is to split one report file back to five base files. In the report file, my 5 portion are seperated by 2 lines of space. I have tried CSPILT, AWK, and CUT. It just doesn't work out. Please help... |
Try the line editor ed. I haven't used it recently but would apply it like
First read the file, then repeat until the file is empty: find the double empty lines write line (1,.) to a new file delete line (1,.) endrepeat done! Read "man ed" and work it out! |
Split one big file to 5 files
Thank you so much for the reply. I will try that. If possible, would you please provide a sample code?
Thank Michael |
I have found the exact post on this forum athttp://www.linuxquestions.org/questions/showthread.php?t=182909. But when I tried the perl script in that post, it didn't work out. I have modified the perl as follow:
#!/usr/bin/perl # use strict; use IO::Handle; my ($line, $nr); my $thebigfile = "/home/oracle/projects/Achaya/test/wbreports.txt"; # input file location my $logfile = "/home/oracle/projects/Achaya/test/newwb"; # output files basename my ($previousFileTimeSize, $currentFileTimeSize); $previousFileTimeSize = 1; print "START\n"; open(LOGFILE, ">$logfile"); LOGFILE->autoflush(1); while (1) { $currentFileTimeSize = (stat($thebigfile))[7]; # size print $currentFileTimeSize; if ($currentFileTimeSize != $previousFileTimeSize) { print LOGFILE scalar localtime; print LOGFILE ": sent-mail MODIFIED\n"; $previousFileTimeSize = $currentFileTimeSize; } else { print LOGFILE scalar localtime; print LOGFILE ": sent-mail no modification\n"; } sleep 30; } close LOGFILE; Unfortunately, no outfile was generated. Michael |
Using head and tail would probably be quicker:
#!/bin/bash head -n 20 /tmp/report.txt > /tmp/part.1 head -n 42 /tmp/report.txt | tail -n 20 > /tmp/part.2 head -n 64 /tmp/report.txt | tail -n 20 > /tmp/part.3 head -n 86 /tmp/report.txt | tail -n 20 > /tmp/part.4 |
Csplit is a contextual splitter, so you can split files depending on matching lines. Head and tail would work, but you don't need to know how many lines there are with csplit.
For your file, run something like this: csplit -z infile /"get,"/ '{*}' You'll get files xx00, xx01, etc. xx00 contains the part of infile from the start up to the first matching line. xx01 contains the matching line and up to the next matching line. Etc. Check the man page for more stuff that it'll do. Dogmatix edit: fixed argument order... |
Or simply use ed. If 'text5.txt' is the file with the five sections of arbitrary length subdivided by double empty lines, and if you prepare the content of the file 'edinp' like this:
Code:
e text5.txt P.S. You may extend edinp with more part#s than are actually present in the input file with no harm. Then ed will gracefully exit with ?. And, of course, this approach permits that you change the section location regexp(s) to something more relevant for each section in cases when two empty lines wouldn't suffice. Nice old line editor! |
Thank you all for your help. Because I will get the big report file daily, I need a program to split it into 5 small files base on the space lines. I tried the csplit, it just won't take the blank space as a pattern to split the file. I would like to stick with the solution posted in the previous post at : http://www.linuxquestions.org/questi...d.php?t=182909.
It looks like that's the reasonable solution for my case. Unfortuanlly, I am not perl guy. I got stuck on writing the output to the file. Thanks and have a great weekend Michael |
Most *nix utilities are line-based, so you can't search for two blank lines, only one. I thought each report had a similar header that you could search for ("get," in your example). If not, csplit won't work. Another thought: you could use sed to look for a blank line, and replace it with unique string, the use csplit to find it, then use sed again to change the unique string back to a blank line.
Searching for a blank line in csplit is easy. Use "/^$/" as the regexp. You'll end up with 9 files, though, since 4 of them will be just one blank line. If you just ignore them and use xx00, xx02, xx04, xx06, and xx08, you'll have your 5 reports. Or, you could whip up a perl script. I'd be tempted to write a short C program since I don't know perl either. Dogmatix |
Or, if you know neither Perl nor C/C++ very well but some Bash: Just make a nice Bash script where the Ed line editor is utilized: a straightforward way to obtain the desired functionality in five minutes...
|
All times are GMT -5. The time now is 12:47 PM. |