[SOLVED] sequential : how to find the missing numbers within a sequence of files that have sequential numbers attached to them?

scasey · 07-08-2017, 09:59 AM

That

Code:

$this_file =~ /^(.*?)-(\d+?)-(.*?)\.ext$/)

is from Laserbeak, and it's being used to test if the file in $this_file matches the pattern required to process it. You have (I think) correctly modified the regex to match the actual pattern [1 hyphen, mp4 extension].

If I understand what Laserbeak is doing (and I could be wrong about this), s/he is populating a hash with the values of the existing numbers in the file names, using the number as the key, and setting the value to 1.

Then sort the hash by the key values. Then iterate over the length of the hash and output the numbers that don't exist in the hash. I haven't tried it yet, but it should work. I admit to having some difficulty understanding/using hash processing. I'm going to play with this script for my own edification. I think there's also something for me to learn about capturing data from a regexp match. Fun!

My script captures the list of existing numbers in an array [@existnums], and captures the list of all numbers from 1 to entered value (270) in another array [@allnums]. Then iterate over the allnums array and set the entries that match existnums to 0...then print the numbers that are not 0. This is an adaptation of an application I wrote to draw cards from a tarot deck...as each card was drawn, it was saved into "drawn" array, then before the next card was drawn the drawn cards are removed from the "all cards" array -- so the app wouldn't draw the same card twice. My customer said the application gave her as random a draw as using the actual cards.

I'm glad you have what you need now. I hope we've demonstrated the value of learning regexp. Those same patterns could be applied using sed in a bash script, but I'm even fuzzier about bash arrays and hashes.

Laserbeak · 07-08-2017, 10:05 AM

Quote:

Originally Posted by scasey

If I understand what Laserbeak is doing (and I could be wrong about this), s/he is populating a hash with the values of the existing numbers in the file names, using the number as the key, and setting the value to 1.

Then sort the hash by the key values. Then iterate over the length of the hash and output the numbers that don't exist in the hash. I haven't tried it yet, but it should work. I admit to having some difficulty understanding/using hash processing. I'm going to play with this script for my own edification. I think there's also something for me to learn about capturing data from a regexp match. Fun!

I'm a "he" and you hit the nail on the head as far as the algorithm. In fact, that algorithm is used all the time in real-world Perl applications, I can't imagine how many times I've used something like that. You would be wise to learn it.

scasey · 07-08-2017, 10:23 AM

Quote:

Originally Posted by Laserbeak

I'm a "he" and you hit the nail on the head as far as the algorithm. In fact, that algorithm is used all the time in real-world Perl applications, I can't imagine how many times I've used something like that. You would be wise to learn it.

Just went over it. Very elegant. I'll add it to my library for sure. Thank you.

BW-userx · 07-08-2017, 10:30 AM

@Laserbeak @scasey

well this exposure you two have given my to perl - I am now reading up on it, I am up to this page as of right now.
https://www.tutorialspoint.com/perl/perl_hashes.htm

it is not much different then bash per se' just the syntax for declaring vars and populating arrays is different, and they got some funny things one can do with it - don't know if BASH can do the same with arrays - I only use arrays sparely.

as soon as I get a little more into how to's I think I might rewrite one of my bash scripts in perl to see what I can do with it.

when I get done reading I'll go back over these two scripts to adsorb its contents better.
and that algorithm

though I still have not found this
my this and my that when declaring or calling for

Code:

my @sortedarray

which I find strange
but
big thanks for the help!

Laserbeak · 07-08-2017, 11:15 AM

Quote:

Originally Posted by BW-userx

though I still have not found this
my this and my that when declaring or calling for

Code:

my @sortedarray

which I find strange
but
big thanks for the help!

It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.

When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.

BW-userx · 07-08-2017, 11:19 AM

Quote:

Originally Posted by Laserbeak

It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.

When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.

so keeps vars local - got to this page in Subroutine taking about sub and my for declaring functions.
https://www.tutorialspoint.com/perl/...ubroutines.htm

and looking at yours and the others script and watching Dr. Who -- multi tasking

BW-userx · 07-08-2017, 11:47 AM

Quote:

Originally Posted by scasey

Mayhaps, here's what I just ran...you'll need to change the $working_dir value againl

Code:

#!/usr/bin/perl
## ^^ set to location of your perl

$working_dir="/run/media/userx/250GB/NumberedFiles";
##~ $working_dir=".";   # testing

if ($ARGV[0]) {
	$max = $ARGV[0];   ## get max number from the command line
	$max++;
}
else {
	print "usage is $0 maxvalue";
}

## get list of files name in array  Names are in format of FileName-nnn-xxxxxxx.ext
@files=`find "$working_dir" -type f`;

## remove leading and trailing parts
foreach $file (@files) {
	$file =~ s/^.*?-//;    #remove from beginning to first hyphen
	$file =~ s/-.*$//;	#remove from second hyphen to end
	$existnums[$file]=$file;  #save what's left in array
}

## populate array with all the numbers: 1-input value 
for ($i = 1; $i < $max; $i++) {
    $allnums[$i] = $i;
}

## remove existing numbers from full list
foreach $nbr (@existnums) {
    $allnums[$nbr] = 0
}

## print out the remaining (i.e. missing) numbers
## note, no sorting required because the allnums array is populated in sequence
foreach $nbr  (@allnums) {
	if ($allnums[$nbr] ne 0) {      ## only print the entries that are not -0-
##~ 		print "$allnums[$nbr] ";   ## or 
		print "$allnums[$nbr]\n";  ## to do one per line
	}
}

print "\n";  ## when printing all on one line.

ok that wasn't working I fixed it by chaining the first stripping of the string

Code:

$file =~ s/.*-//; #removes everything up and including last hyphen

because it was still just taking out part of the beginning of the string and leaving part of it attached to the filename. there is a hyphen within the path too.

now both yours and Laserbeak works
yours

Code:

userx%slackwhere ⚡ scripts ⚡> ./perl-find-missing-numbers.pl 270
162
170
172
173
174
175
181
186
195
196
197
198
245

Laserbeak

Code:

userx%slackwhere ⚡ scripts ⚡> ./perl-number-list
162 is missing!
170 is missing!
172 is missing!
173 is missing!
174 is missing!
175 is missing!
181 is missing!
186 is missing!
195 is missing!
196 is missing!
197 is missing!
198 is missing!
245 is missing!

they both match! woo hoo!

BW-userx · 07-08-2017, 12:20 PM

Quote:

Originally Posted by Laserbeak

It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.

When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $working_dir="/run/media/userx/3TB-External/Files-Resampled";

opendir(DIR, $working_dir) || die "Can't open $working_dir: $!\n";
  while( (my $filename = readdir(DIR))){
    push(my @files, $filename);
    print ("@files\n");
     
    }
closedir(DIR);

there probably is a better way to populate an array in perl but this worked and it is interesting nonetheless. push and pop - link list ?

allend · 07-08-2017, 09:38 PM

As schneidz and danielbmartin have already hinted

Code:

seq 1 270 > numbers.txt; ls -1 /run/media/userx/3TB-External/Files-Resampled/*.mp4 | rev | cut -d "." -f2 | cut -d "-" -f1 | rev | sort | comm -3 numbers.txt -

BW-userx · 07-08-2017, 10:25 PM

Quote:

Originally Posted by allend

As schneidz and danielbmartin have already hinted

Code:

seq 1 270 > numbers.txt; ls -1 /run/media/userx/3TB-External/Files-Resampled/*.mp4 | rev | cut -d "." -f2 | cut -d "-" -f1 | rev | sort | comm -3 numbers.txt -

I don't think I could fit all of that on my terminal.

Laserbeak · 07-08-2017, 11:47 PM

Quote:

Originally Posted by BW-userx

there probably is a better way to populate an array in perl but this worked and it is interesting nonetheless. push and pop - link list ?

The only four ways I know is using push and doing this:

Code:

# TO ADD TO THE BACK

@thearray = (@thearray, $newelement);

# OR TO PUT AT THE FRONT

@thearray = ($newelement, @thearray);

#OR

unshift @thearray,  $newelement; # places $newelement as the first element of the array

I think push and unshift are faster than the equivalent manually creating a new array examples.

Actually there is another way, but for some reason it gives a warning:

Code:

#!/usr/bin/perl

use strict;
use warnings;

our @array = ();
for (1..10) {
  $array[$_] = $_;
}
print join("\n", @array) , "\n";

Also see that I used "our" instead of "my", that is how to specifically define a global variable, put it at the top of the file of all your files in a projecr and they will share that variable.

Laserbeak · 07-09-2017, 12:33 AM

I meant to edit this into the last post, but it ended up as a new post.
This is a very simple example how "our" works.

Code:

#!/usr/bin/perl

use strict;
use warnings;

our @array = ();
for (1..10) {
  $array[$_] = $_;
}

require "printest.pl";
exit(0);

#-------------------
#PRINTEST.PL TEXT:

#!/usr/bin/perl

use strict;
use warnings;

our @array;

print join ("\n", @array), "\n";

# OUTPUT

1
2
3
4
5
6
7
8
9
10

scasey · 07-09-2017, 04:48 PM

There are also excellent tutorials at http://learn.perl.org/

BW-userx · 07-09-2017, 06:20 PM

Quote:

Originally Posted by scasey

There are also excellent tutorials at http://learn.perl.org/

bookmarked
thanks

Sefyir · 07-10-2017, 07:28 PM

This appears to have been [SOLVED] and the focus has been on perl.. but for good fun I implemented it in python3.
At the heart is the set.difference(set) comparison, that quickly identifies what is the difference between the two sets (the range from 1 to max files found matching regex and all available found files).
It also auto-detects the maximum range needed. This is not a perfect design since what happens if the topmost file is deleted? It won't be detected. However, this now allows you to check n directories with n files.
I tested it on 4690 directories with 999 files of each (max allowed). Some ended up being empty (pushed some kind of limit

)

Code:

for i in {1..4690}; do mkdir $i_a; done
for i in */; do touch "$i"/file-{000..999}-adfg324.ext; done
for i in */; do rm $i/file-$(echo $RANDOM | cut -b1-3)-adfg324.ext; done # Randomly remove a file from each directory

Code:

$ time ./numbersequencer.py number_testing/*/ # total 4690 directories
Missing Files in /home/user/number_testing/1000_a:
239
Missing Files in /home/user/number_testing/1001_a:
223
Missing Files in /home/user/number_testing/1002_a:
882
...
real	0m8.508s

Code:

$ rm 2407_a/file-{497,987,999}-adfg324.ext
$ ./numbersequencer.py number_testing/2407_a/
Missing Files in /home/user/number_testing/2407_a:
497
987

Code:

#!/usr/bin/env python3

from __future__ import print_function, division

import argparse
import os
import re
import sys

class SequencyConsistency():
    def __init__(self, sequence, regex_def_grp=None):
        if regex_def_grp:
            self.regex_compiled, self.regex_group_num = regex_def_grp
        else:
            self.regex_compiled, self.regex_group_num = (re.compile('(\d+)'), 1)

        self.sequence = sequence
        self.missing_sequencies = self.__missing_number_sequence()

    def __missing_number_sequence(self):
        self.matches = {str(self.regex_compiled.search(sequence).group(self.regex_group_num))
                        for sequence in self.sequence
                        if self.regex_compiled.search(sequence) != None}

        if self.matches:
            # Generate set of numbers from 1 to highest regex detected
            try:
                max_range = max(int(match) for match in self.matches)
            except ValueError:
                print('Regex must match a integer.', end='\n\n')
                raise

            full_range = set(str(num).rjust(len(str(max_range)), '0') 
                             for num in range(1, max_range + 1))
            return full_range.difference(self.matches)
        
    def print_missing_sequencies(self, reverse=False):
        if getattr(self, 'missing_sequencies', None):
            return sorted(self.__missing_number_sequence(), reverse=reverse)


class DirectoryConsistency(SequencyConsistency):
    def __init__(self, directory, regex_def_grp=None):
        try:
            self.sequence = os.listdir(directory)
        except (FileNotFoundError, NotADirectoryError, PermissionError) as err:
            # Exit class if directory indicated is incorrect
            print(err, file=sys.stderr)
            return None

        # Inherit from SequenceConsistency
        SequencyConsistency.__init__(self, self.sequence, regex_def_grp)

def main():
    # Create commandline flags
    parser = argparse.ArgumentParser(description='Detect missing portion of sequency in given sequency')
    parser.add_argument('-e', '--regexp',
            type=str,
            required=False,
            default='(\d+)',
            help='Set regex. Defaults to (\d+)')
    parser.add_argument('-g', '--group',
            type=int,
            required=False,
            default=1,
            help='Set regex matching group to identify sequence. --group 3 will match (\d+) in regexp of (\w)(-)(\d+). Each () identifies a group')
    args, other_args = parser.parse_known_args()

    directories = other_args if other_args else sys.stdin
    for directory in directories:
        current_dir = DirectoryConsistency(directory.strip(),
                                           regex_def_grp=(re.compile(args.regexp), args.group))

        missing_in_dir = current_dir.print_missing_sequencies()

        if missing_in_dir:
            print('Missing from {directory}:'.format(directory=repr(os.path.abspath(directory.strip()))))
            print('{sequences}'.format(sequences='\n'.join(seq for seq in missing_in_dir)))

if __name__ == '__main__':
    main()