LinuxQuestions.org - tough one: how do you find patterns/sequences in file names?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - tough one: how do you find patterns/sequences in file names? (https://www.linuxquestions.org/questions/programming-9/tough-one-how-do-you-find-patterns-sequences-in-file-names-592621/)

tough one: how do you find patterns/sequences in file names?

say you have a dir with files:

file.1.foo
file.2.foo
file.3.foo
bar.100.gah
bar.101.gah
bar.102.gah
someFile1
otherThing2

... how would you go about finding that there are 2 sequences there and two files that are not part of any sequence? i.e.:

seq: file.#.foo 1-3
seq: bar.#.gah 100-102
sngl: someFile1
sngl: otherThing2

I have a couple really complex examples, but it doesn't seem like it should take over 300 lines of code to do this. Does anyone know of a good way to find this info? Some super cool regex or something?

Language doesn't matter much as long as it's nice & tidy. If I had my druthers, the answer would be in done in python, but I can translate if need be.

Have you read the man page for grep?

man grep

If I understand your question correctly, then I believe that grep is what you are looking for.

Regards

What about the base where you have something like this:

Code:

file.1.middle.3.end

file.1.middle.4.end

file.1.middle.5.end

file.2.middle.6.end

file.3.middle.7.end

What desired output do you have for this list?

Also, how would you handle something like this (where a number in a range is missing)?

Code:

file.1.end

file.2.end

file.5.end

file.6.end

I really hate Perl, and I really suck with Perl.
That being said, here's my solution.

Code:

use strict;

use warnings;



my (%files, @info, $info, @single);

while(<>){

        chomp;

        s/([0-9]+)/#/g;

        $info = $files{$_} or ();

        @{$info}[0]++;

        if(@{$info}[1]){

                @{$info}[2] = $1;

        } else {

                @{$info}[1] = $1;

        }

        $files{$_} = $info;

        #print "$files{$_}->[0]"

}

for (keys %files){

        if ($files{$_}->[0]>1){

                print "seq: $_ $files{$_}->[1]-$files{$_}->[2]\n"

        } else {

                s/#/$files{$_}->[1]/;

                push @single, "sngl: $_\n";

        }

}

print $_ for @single;

I think this is what you wanted, though this is very basic and will have issues with some of the things matthewg42 mentioned (more then one number, missing numbers in set). oh...and It will garble up the name if '#' is part of the name :/. I know this is a very sloppy job..but it gets the job done to some extent. If you need a more complicated solution, provide more complicated examples.

Code:

$ cat list

file.1.foo

file.2.foo

file.3.foo

bar.100.gah

bar.101.gah

bar.102.gah

someFile1

otherThing2



$ perl findpattern.pl list

seq: bar.#.gah 100-102

seq: file.#.foo 1-3

sngl: otherThing2

sngl: someFile1

just one way out of the many with GNUawk

Code:

ls -1 | awk 'BEGIN{FS="."}

{

    x[$1]++    

    y[$1] = y[$1]","$2

    

}

END{

    for(i in x) {

          if ( x[i] > 1 ){

              for(k in y){

                    if( k==i){

                        sub(/^,/,"",y[k])

                        j=split(y[k],calc,",")

                        b=asort(calc,dest)

                        print "seq: " k,dest[1]"-"dest[b]

                    }

              }

          }

          else{

              print i

          }

    }

}

'

Wow, thanks guys! I wasn't expecting to get answers on this one.

In answer matthewg42 questions -
all the file1's are one group, the file2 & file3 are singles.
if a number is missing, you can consider it two sequences. If the report is on the smart side, it would report something like:

foo.#.end 1-3,5-6

but

foo.#.end 1-3
foo.#.end 5-6

is also acceptable

'#' is arbitrary - it could be anything.. '#' makes sense. '%04d' makes a lot of sense.

but yeah... these give me great starting points. Thanks!