[SOLVED] Perl Array to remove duplicates

d072330 · 01-23-2013, 01:09 PM

I have tried several different methods and cannot seem to figure this out. I have removed duplicate entries from arrays before but for some reason it will not remove the duplicate arrays.

I am trying to get the file size of every file in a directory and then remove all duplicate entries and thus should be left with four or five file sizes that I will then use later in my script.

Here are parts of the code:

Quote:

@files = `ls $indir`; # get files in the directory into array

foreach $files (@files)
{
chomp $files;
if ($files)
{
$filesize = -s $files;

@na = $filesize;
@uniq = uniq @na;
print "@uniq\n";

When I print the @uniq array I get the same output as I do by just printing the @na array. Any thoughts as to why the uniq is not stripping out the duplicate entries? I would expect the output below to just be three numbers.

Quote:

Output:
4807088
4807088
4807088
4807088
4807088
4807088
4807088
4807088
4807088
57683648
57683648
57683648
57683648
57683648
57683648
57683648
20

sundialsvcs · 01-23-2013, 02:06 PM

For all Perl-related questions, I suggest you pay a visit to http://www.perlmonks.org, where you will almost-instantly obtain an answer to all questions like these.

Since I haunt both places, I can say that by far the easiest way to do this is with a hash ... an associative array. Use a statement such as:
%myhash->{$filesize} = 1;

The only purpose of this statement is to define a key corresponding to the $filesize ... you don't care at all about the value assigned (which happens to be "1"). What you do know, however, is that every key in a hash is unique. Therefore, after you have processed all your files, you can now iterate through this hash with something like:
foreach my $size (keys (%myhash)) { ... }

The loop will iterate through the list of keys that exist ... each one of these keys corresponds to a file-size that was encountered, and occurs only once. Q.E.D.

Perl is a very rich and expressive language (despite its warts, which everybody knows to tolerate), with an enormous library of tested packages in its so-called CPAN library ... including packages to iterate through file-directories and so on.

You will in time discover why Perl is referred to as "the Swiss Army® Knife of practical programming." Take the time necessary to really get to know this tool in particular.

fl0 · 01-23-2013, 02:24 PM

Hi,

uniq is no perl function, where do you get the function? From List::MoreUtils ?

Hint: try

Code:

perldoc -q duplicate

regards fl0

d072330 · 01-23-2013, 02:34 PM

@sundialsvcs - thanks for the reply. I have not had the need to use keys yet so bare with me if I have more questions. Please and thank you!

@fl0 - Yes I am using the following:

Quote:

use List::MoreUtils qw(uniq);
use Data:

umper qw(Dumper);

fl0 · 01-23-2013, 02:41 PM

ok if i running these test code, it works, can you post more of your code? Maybe you need to move the uniq out of the foreach loop

Code:

 
#!/usr/bin/perl

use strict;
use warnings;
use List::MoreUtils qw( uniq );

my @files = qw(a a a b b m j h g);

foreach my $file (@files){

    print "$file\n";

}

my @uniq_file = uniq @files;

print "@uniq_file\n";
~

d072330 · 01-23-2013, 02:54 PM

plugging away at this but first run gave me this message after adding my %hash->{$filesize} = 1;

Quote:

Using a hash as a reference is deprecated at /usr/local/bin/./segdcat.pl line
52 (#1)
(D deprecated) You tried to use a hash as a reference, as in
%foo->{"bar"} or %$ref->{"hello"}. Versions of perl <= 5.6.1
used to allow this syntax, but shouldn't have. It is now deprecated, and will
be removed in a future version.

d072330 · 01-23-2013, 02:56 PM

Code:

Quote:

#!/usr/bin/perl

#######################
### Script Settings ###
#######################
use diagnostics;
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use Data:

umper qw(Dumper);

########################
### Variables ###
########################
my $filebeg = "cont";
my $filext = ".sgd";
my ($indir, $filename, $startf, $endf, $start, $end, $files);
my ($zeros, $outdir, $outdisk, $filesize, $filesizechk, $filenum);
my (@files, @na, @uniq);
my $size1 = "size1.txt";
my $size2 = "size2.txt";
my $oddballs = "oddballs.txt";

####################
### Main Script ###
####################

### Get Folder name that the SGD files are in ###
$indir = "/home/prouser/concatdir/indir";

### Get the Folder name that you want to output Concats to ###
$outdir = "/home/prouser/concatdir/outdir";

### Get the starting file number without zeros, cont and .sgd ###
#print "Enter Starting File # (ex: cont000001.sgd): ";
#chomp ($filename = <stdin>);
$filename = "cont000001.sgd";

### Check the size of the first file in the range ###
$filesizechk = -s "$indir/$filename";

### Put all the files in the range into an array to work with ###
@files = `ls $indir`;

foreach $files (@files)
{
chomp $files; ### Chomp off the carriage return ###
if ($files)
{
$filesize = -s $files; ### Get the filesize of each file in the range ###

if ($filesize == $filesizechk) ### Check if the filesize matches first filesize in range ###
{
### Print file names to a file for use in ProMAX ###
#print "size1 ::: $files\n";
#`ls $files >> $outdir/$size1`;
}
elsif ($filesize != $filesizechk)
{
### Print file names to a file for use in ProMAX ###
#print "size2 ::: $files\n";
#`ls $files >> $outdir/$size2`;
}
else
{
### Print oddball sized files to oddball file ###
#print "oddballs ::: $files\n";
#`ls $files >> $outdir/$oddballs`;
}
}
}
#####################
### End of Script ###
#####################

fl0 · 01-23-2013, 02:56 PM

can u please post the complete code?

EDIT: ok i am to slow

d072330 · 01-23-2013, 02:57 PM

@fl0 - I did try that in a separate script already and it worked so I have been thinking the same as you that it may need to moved outside of the loop or something.

fl0 · 01-23-2013, 03:10 PM

ok so your problem is solved?

d072330 · 01-23-2013, 03:16 PM

nope!

fl0 · 01-23-2013, 03:21 PM

can u post the correct script? i can not find the uniq section i your posted script, and can you describe what exactly do you want to do?

d072330 · 01-23-2013, 03:36 PM

add this under $filesize = -s $files;

Quote:

@na = $filesize;
@uniq = uniq @na;
print "@uniq\n";

I had taken it out because it reproduced the same results as print $filesize.

d072330 · 01-23-2013, 03:40 PM

I need to sort a 1 TB drive by file size and then output the file names into a text file 9000 at a time. The program we use will only accept 9000 files at one time.

Usually there is only about 3 different file sizes. If I can get the file sizes correctly output to a variable I can then correctly format my if statements to sort by 9000 files at a time to the output text files.

The current disk I am working on is has 32000 files in it and it has 4 different file sizes.

Clear as mud? LOL

fl0 · 01-23-2013, 03:41 PM

ok to understand what you are doing:

why you need to uniq the array? are files with the same size in the diretory?

here is a quick version for the first part of your script, not tested, but much easier and should do what you want (hopfully).

Code:

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( uniq );
use Data::Dumper;

my $input_dir = '/home/prouser/concatdir/indir';

#get all entrys in the directory
my @files = glob "$input_dir/*.sgd";

my %file_sizes;

foreach my $file ( @files ){

      #add the filename as key and the size as value
  
      $filesizes{ $file } = -s $file if not -z $file;

}


print Dumper(%filesizes);

With this code you can sort / uniq the hash by values, and iterate over the hash with count 9000..

EDIT: ok i removed the first_file_check, is not working