Split a large file and get the names of output files using Perl

Sherlock · 01-30-2007, 02:08 AM

Hi,

I have to take a large file (name from command line) and then check its line count and then make them into small files.

I want to capture the name of all the small files that have been created and then parse through these small files to check for a condition..in the date.

All these has to be accomplished in a Perl script.

Thanks in advance

matthewg42 · 01-30-2007, 02:40 AM

Does it have to split the files at a line break?

Do you want to specify a prefix for the split file names?

When you say "check for a condition in the date", what do you mean? It sounds like these are log files, so maybe each line starts with a date? Can you provide an example line?

Sherlock · 01-30-2007, 02:55 AM

Hi,

I was able to parse through the data file and check the condition but the file name was hardcoded in the perl script.

Now my trouble is with the first part of the problem : regarding splitting the file and taking the names of the files created so that i can parse them.

With

$count=`wc -l Temp.dat`;

I am able to get the count of the lines...now to split the large file ..i want to use split function but how can i capture the names of teh files created???

Thanks

Sherlock · 01-30-2007, 09:28 AM

Any one knows!!!

matthewg42 · 01-30-2007, 09:55 AM

If you answer my questions, I can help.

Sherlock · 01-30-2007, 11:42 PM

I am splitting the file using split -l option
It is working fine....but i want to capture the names of the files generated programmatically

For the present,I am hardcoding the names.

I parse the generated file and check whether a particular column length is beyond a value....

Regards

matthewg42 · 01-31-2007, 04:45 AM

split will print the names of the files it creates to standard output,

Code:

$ split --verbose -l 10 input_file
creating file `xaa'
creating file `xab'
creating file `xac'
creating file `xad'

You can capture it like this:

Code:

open(SPLIT, "split -l 10 --verbose input_file 2>&1 |") || die "couldn't run split : $!\n";

Then all you have to do it read the input and extract the file names. Careful to escape meta-characters in the regular expression used to extract the filenames:

Code:

my @files;
while (<SPLIT>) {
        if ( /^creating file \`(.*)'$/ ) {
                push(@files, $1);
        }
        else {
                warn "Oh dear, a line of input we can't parse: $_;";
        }
}
close(SPLIT);

# now you have extracted them, you can do what you like...
foreach my $file (@files) {
        print "got a file name: $file\n";
}

}[/CODE]

Sherlock · 01-31-2007, 08:23 AM

Code:

open(SPLIT, "split -l 10 --verbose input_file 2>&1 |") || die "couldn't run split : $!\n";

Is there any difference when executing this code and split from command prompt..

and what is this for in the while loop

" if ( /^creating file \`(.*)'$/ ) {"

I donthv much idea abt perl....its an urgent requirement...

matthewg42 · 01-31-2007, 09:41 AM

Quote:

Originally Posted by Sherlock

Code:

open(SPLIT, "split -l 10 --verbose input_file 2>&1 |") || die "couldn't run split : $!\n";

Is there any difference when executing this code and split from command prompt..

The program will be invoked the same, except the output will not go to the terminal - it will be readable from the SPLIT filehandle.

Quote:

and what is this for in the while loop

" if ( /^creating file \`(.*)'$/ ) {"

I donthv much idea abt perl....its an urgent requirement...

The m/PATTERN/ operator (see the perlop manual page for full documentation) tests to see if some value (by default $_ - the current input line) matches a regular expression pattern, PATTERN.

/PATTERN/ is an abbreviation of m/PATTERN/.

Regular expressions are the most amazing things, but I'm not going to describe them fully here. You should read the perlre manual page. Parts of a regular expression in (parenthesis), if found, are assigned to the variables $1, $2, $3 etc. So the whole if block says: "if the current line of input matches this pattern, push the bit between the parenthesis onto the array @files, otherwise print a warning".

Sherlock · 02-01-2007, 12:40 AM

Thanks matthew!!!!!

I hv a doubt abt push

for ($i=0;$i < 5;$i++) {
push(@file_array,"a");
}
# shift(@file_array);

when i print $file_array[0]

ouptut is nothing
but for index [1] it is a

If i use shift it is fine..i am able to remove the first non existent value...

Why does push behave like this..???

matthewg42 · 02-01-2007, 03:58 AM

Works for me.

Code:

#!/usr/bin/perl -w

use strict;

my @a;
for(my $i=0; $i<3; $i++) {
        push(@a, "a");
}

foreach my $val (@a) {
        print "got: $val\n";
}

Output is:

Code:

got: a
got: a
got: a

Post your full code and output. Use [code] tags to make it more readable.

Sherlock · 02-01-2007, 07:28 AM

Code:

#!/usr/bin/perl

@file_array=undef;

for ($i=0;$i < 5;$i++) {
    push(@file_array,"a");
}
 # shift(@file_array);

print $file_array[0] #nothing prints
print $file_array[1] #a
print $file_array[2] #a

matthewg42 · 02-01-2007, 08:03 AM

It is because when you are initialising the array @file_array with one member - undef, and then in the for loop pushing "a"s onto it. After the for loop, it looks like this: (undef, "a", "a", "a", "a", "a").

There is a difference between doing this:

Code:

@array = undef;

and this:

Code:

@array = ();

The first creates @array with one member, undef. The second creates an array with no members.

By the way, if you started your script like this:

Code:

#!/usr/bin/perl -w

use strict;
...

You would have been warned when you print the first member of the array that you are trying to print an undef. It's good practice to use strict and the -w flag whenever possible. The are a few exceptions when it's not worth it, but they are few and far between.

Sherlock · 02-01-2007, 08:28 AM

Hi matthew,

Thanks for the input!!!

I executed your code and this is the output i got...

bash-2.03$ perl TestSpli.pl
Oh dear, a line of input we can't parse: split: illegal option -- -
; at TestSpli.pl line 10, <SPLIT> chunk 1.
Oh dear, a line of input we can't parse: Usage: split [-l #] [-a #] [file [name]]
; at TestSpli.pl line 10, <SPLIT> chunk 2.
Oh dear, a line of input we can't parse: split [-b #[k|m]] [-a #] [file [name]]
; at TestSpli.pl line 10, <SPLIT> chunk 3.
Oh dear, a line of input we can't parse: split [-#] [-a #] [file [name]]
; at TestSpli.pl line 10, <SPLIT> chunk 4.

I tried in general

bash-2.03$ split --verbose -l 10 Employees.txt
split: illegal option -- -
Usage: split [-l #] [-a #] [file [name]]
split [-b #[k|m]] [-a #] [file [name]]
split [-#] [-a #] [file [name]]

Regards

matthewg42 · 02-01-2007, 08:41 AM

Looks like you don't have the same version of split that I do - are you using the GNU implementation? What OS are you doing this on?