[SOLVED] sequential : how to find the missing numbers within a sequence of files that have sequential numbers attached to them?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
is from Laserbeak, and it's being used to test if the file in $this_file matches the pattern required to process it. You have (I think) correctly modified the regex to match the actual pattern [1 hyphen, mp4 extension].
If I understand what Laserbeak is doing (and I could be wrong about this), s/he is populating a hash with the values of the existing numbers in the file names, using the number as the key, and setting the value to 1.
Then sort the hash by the key values. Then iterate over the length of the hash and output the numbers that don't exist in the hash. I haven't tried it yet, but it should work. I admit to having some difficulty understanding/using hash processing. I'm going to play with this script for my own edification. I think there's also something for me to learn about capturing data from a regexp match. Fun!
My script captures the list of existing numbers in an array [@existnums], and captures the list of all numbers from 1 to entered value (270) in another array [@allnums]. Then iterate over the allnums array and set the entries that match existnums to 0...then print the numbers that are not 0. This is an adaptation of an application I wrote to draw cards from a tarot deck...as each card was drawn, it was saved into "drawn" array, then before the next card was drawn the drawn cards are removed from the "all cards" array -- so the app wouldn't draw the same card twice. My customer said the application gave her as random a draw as using the actual cards.
I'm glad you have what you need now. I hope we've demonstrated the value of learning regexp. Those same patterns could be applied using sed in a bash script, but I'm even fuzzier about bash arrays and hashes.
If I understand what Laserbeak is doing (and I could be wrong about this), s/he is populating a hash with the values of the existing numbers in the file names, using the number as the key, and setting the value to 1.
Then sort the hash by the key values. Then iterate over the length of the hash and output the numbers that don't exist in the hash. I haven't tried it yet, but it should work. I admit to having some difficulty understanding/using hash processing. I'm going to play with this script for my own edification. I think there's also something for me to learn about capturing data from a regexp match. Fun!
I'm a "he" and you hit the nail on the head as far as the algorithm. In fact, that algorithm is used all the time in real-world Perl applications, I can't imagine how many times I've used something like that. You would be wise to learn it.
I'm a "he" and you hit the nail on the head as far as the algorithm. In fact, that algorithm is used all the time in real-world Perl applications, I can't imagine how many times I've used something like that. You would be wise to learn it.
Just went over it. Very elegant. I'll add it to my library for sure. Thank you.
it is not much different then bash per se' just the syntax for declaring vars and populating arrays is different, and they got some funny things one can do with it - don't know if BASH can do the same with arrays - I only use arrays sparely.
as soon as I get a little more into how to's I think I might rewrite one of my bash scripts in perl to see what I can do with it.
when I get done reading I'll go back over these two scripts to adsorb its contents better.
and that algorithm
though I still have not found this
my this and my that when declaring or calling for
though I still have not found this
my this and my that when declaring or calling for
Code:
my @sortedarray
which I find strange
but
big thanks for the help!
It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.
When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.
It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.
When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.
Mayhaps, here's what I just ran...you'll need to change the $working_dir value againl
Code:
#!/usr/bin/perl
## ^^ set to location of your perl
$working_dir="/run/media/userx/250GB/NumberedFiles";
##~ $working_dir="."; # testing
if ($ARGV[0]) {
$max = $ARGV[0]; ## get max number from the command line
$max++;
}
else {
print "usage is $0 maxvalue";
}
## get list of files name in array Names are in format of FileName-nnn-xxxxxxx.ext
@files=`find "$working_dir" -type f`;
## remove leading and trailing parts
foreach $file (@files) {
$file =~ s/^.*?-//; #remove from beginning to first hyphen
$file =~ s/-.*$//; #remove from second hyphen to end
$existnums[$file]=$file; #save what's left in array
}
## populate array with all the numbers: 1-input value
for ($i = 1; $i < $max; $i++) {
$allnums[$i] = $i;
}
## remove existing numbers from full list
foreach $nbr (@existnums) {
$allnums[$nbr] = 0
}
## print out the remaining (i.e. missing) numbers
## note, no sorting required because the allnums array is populated in sequence
foreach $nbr (@allnums) {
if ($allnums[$nbr] ne 0) { ## only print the entries that are not -0-
##~ print "$allnums[$nbr] "; ## or
print "$allnums[$nbr]\n"; ## to do one per line
}
}
print "\n"; ## when printing all on one line.
ok that wasn't working I fixed it by chaining the first stripping of the string
Code:
$file =~ s/.*-//; #removes everything up and including last hyphen
because it was still just taking out part of the beginning of the string and leaving part of it attached to the filename. there is a hyphen within the path too.
userx%slackwhere ⚡ scripts ⚡> ./perl-number-list
162 is missing!
170 is missing!
172 is missing!
173 is missing!
174 is missing!
175 is missing!
181 is missing!
186 is missing!
195 is missing!
196 is missing!
197 is missing!
198 is missing!
245 is missing!
It is basically declaring the variable in that scope. Without "my" all variables become global variables, and there can be conflicts, especially in large projects and those that use a lot of libraries.
When you put in "use strict;" Perl demands it or another way to specify variable scopes. That's really recommended for all code, but especially code that could be used for production.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $working_dir="/run/media/userx/3TB-External/Files-Resampled";
opendir(DIR, $working_dir) || die "Can't open $working_dir: $!\n";
while( (my $filename = readdir(DIR))){
push(my @files, $filename);
print ("@files\n");
}
closedir(DIR);
there probably is a better way to populate an array in perl but this worked and it is interesting nonetheless. push and pop - link list ?
there probably is a better way to populate an array in perl but this worked and it is interesting nonetheless. push and pop - link list ?
The only four ways I know is using push and doing this:
Code:
# TO ADD TO THE BACK
@thearray = (@thearray, $newelement);
# OR TO PUT AT THE FRONT
@thearray = ($newelement, @thearray);
#OR
unshift @thearray, $newelement; # places $newelement as the first element of the array
I think push and unshift are faster than the equivalent manually creating a new array examples.
Actually there is another way, but for some reason it gives a warning:
Code:
#!/usr/bin/perl
use strict;
use warnings;
our @array = ();
for (1..10) {
$array[$_] = $_;
}
print join("\n", @array) , "\n";
Also see that I used "our" instead of "my", that is how to specifically define a global variable, put it at the top of the file of all your files in a projecr and they will share that variable.
Last edited by Laserbeak; 07-09-2017 at 12:26 AM.
Reason: Better print
This appears to have been [SOLVED] and the focus has been on perl.. but for good fun I implemented it in python3.
At the heart is the set.difference(set) comparison, that quickly identifies what is the difference between the two sets (the range from 1 to max files found matching regex and all available found files).
It also auto-detects the maximum range needed. This is not a perfect design since what happens if the topmost file is deleted? It won't be detected. However, this now allows you to check n directories with n files.
I tested it on 4690 directories with 999 files of each (max allowed). Some ended up being empty (pushed some kind of limit )
Code:
for i in {1..4690}; do mkdir $i_a; done
for i in */; do touch "$i"/file-{000..999}-adfg324.ext; done
for i in */; do rm $i/file-$(echo $RANDOM | cut -b1-3)-adfg324.ext; done # Randomly remove a file from each directory
Code:
$ time ./numbersequencer.py number_testing/*/ # total 4690 directories
Missing Files in /home/user/number_testing/1000_a:
239
Missing Files in /home/user/number_testing/1001_a:
223
Missing Files in /home/user/number_testing/1002_a:
882
...real 0m8.508s
#!/usr/bin/env python3
from __future__ import print_function, division
import argparse
import os
import re
import sys
class SequencyConsistency():
def __init__(self, sequence, regex_def_grp=None):
if regex_def_grp:
self.regex_compiled, self.regex_group_num = regex_def_grp
else:
self.regex_compiled, self.regex_group_num = (re.compile('(\d+)'), 1)
self.sequence = sequence
self.missing_sequencies = self.__missing_number_sequence()
def __missing_number_sequence(self):
self.matches = {str(self.regex_compiled.search(sequence).group(self.regex_group_num))
for sequence in self.sequence
if self.regex_compiled.search(sequence) != None}
if self.matches:
# Generate set of numbers from 1 to highest regex detected
try:
max_range = max(int(match) for match in self.matches)
except ValueError:
print('Regex must match a integer.', end='\n\n')
raise
full_range = set(str(num).rjust(len(str(max_range)), '0')
for num in range(1, max_range + 1))
return full_range.difference(self.matches)
def print_missing_sequencies(self, reverse=False):
if getattr(self, 'missing_sequencies', None):
return sorted(self.__missing_number_sequence(), reverse=reverse)
class DirectoryConsistency(SequencyConsistency):
def __init__(self, directory, regex_def_grp=None):
try:
self.sequence = os.listdir(directory)
except (FileNotFoundError, NotADirectoryError, PermissionError) as err:
# Exit class if directory indicated is incorrect
print(err, file=sys.stderr)
return None
# Inherit from SequenceConsistency
SequencyConsistency.__init__(self, self.sequence, regex_def_grp)
def main():
# Create commandline flags
parser = argparse.ArgumentParser(description='Detect missing portion of sequency in given sequency')
parser.add_argument('-e', '--regexp',
type=str,
required=False,
default='(\d+)',
help='Set regex. Defaults to (\d+)')
parser.add_argument('-g', '--group',
type=int,
required=False,
default=1,
help='Set regex matching group to identify sequence. --group 3 will match (\d+) in regexp of (\w)(-)(\d+). Each () identifies a group')
args, other_args = parser.parse_known_args()
directories = other_args if other_args else sys.stdin
for directory in directories:
current_dir = DirectoryConsistency(directory.strip(),
regex_def_grp=(re.compile(args.regexp), args.group))
missing_in_dir = current_dir.print_missing_sequencies()
if missing_in_dir:
print('Missing from {directory}:'.format(directory=repr(os.path.abspath(directory.strip()))))
print('{sequences}'.format(sequences='\n'.join(seq for seq in missing_in_dir)))
if __name__ == '__main__':
main()
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.