LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   tough one: how do you find patterns/sequences in file names? (https://www.linuxquestions.org/questions/programming-9/tough-one-how-do-you-find-patterns-sequences-in-file-names-592621/)

BrianK 10-17-2007 10:03 PM

tough one: how do you find patterns/sequences in file names?
 
say you have a dir with files:

file.1.foo
file.2.foo
file.3.foo
bar.100.gah
bar.101.gah
bar.102.gah
someFile1
otherThing2


... how would you go about finding that there are 2 sequences there and two files that are not part of any sequence? i.e.:

seq: file.#.foo 1-3
seq: bar.#.gah 100-102
sngl: someFile1
sngl: otherThing2

I have a couple really complex examples, but it doesn't seem like it should take over 300 lines of code to do this. Does anyone know of a good way to find this info? Some super cool regex or something?

Language doesn't matter much as long as it's nice & tidy. If I had my druthers, the answer would be in done in python, but I can translate if need be.

ray_80 10-17-2007 10:20 PM

Have you read the man page for grep?

man grep


If I understand your question correctly, then I believe that grep is what you are looking for.

Regards

matthewg42 10-17-2007 10:36 PM

What about the base where you have something like this:
Code:

file.1.middle.3.end
file.1.middle.4.end
file.1.middle.5.end
file.2.middle.6.end
file.3.middle.7.end

What desired output do you have for this list?

Also, how would you handle something like this (where a number in a range is missing)?
Code:

file.1.end
file.2.end
file.5.end
file.6.end


angrybanana 10-17-2007 10:47 PM

I really hate Perl, and I really suck with Perl.
That being said, here's my solution.

Code:

use strict;
use warnings;

my (%files, @info, $info, @single);
while(<>){
        chomp;
        s/([0-9]+)/#/g;
        $info = $files{$_} or ();
        @{$info}[0]++;
        if(@{$info}[1]){
                @{$info}[2] = $1;
        } else {
                @{$info}[1] = $1;
        }
        $files{$_} = $info;
        #print "$files{$_}->[0]"
}
for (keys %files){
        if ($files{$_}->[0]>1){
                print "seq: $_ $files{$_}->[1]-$files{$_}->[2]\n"
        } else {
                s/#/$files{$_}->[1]/;
                push @single, "sngl: $_\n";
        }
}
print $_ for @single;

I think this is what you wanted, though this is very basic and will have issues with some of the things matthewg42 mentioned (more then one number, missing numbers in set). oh...and It will garble up the name if '#' is part of the name :/. I know this is a very sloppy job..but it gets the job done to some extent. If you need a more complicated solution, provide more complicated examples.

Code:

$ cat list
file.1.foo
file.2.foo
file.3.foo
bar.100.gah
bar.101.gah
bar.102.gah
someFile1
otherThing2

$ perl findpattern.pl list
seq: bar.#.gah 100-102
seq: file.#.foo 1-3
sngl: otherThing2
sngl: someFile1


ghostdog74 10-18-2007 01:43 AM

just one way out of the many with GNUawk
Code:

ls -1 | awk 'BEGIN{FS="."}
{
    x[$1]++   
    y[$1] = y[$1]","$2
   
}
END{
    for(i in x) {
          if ( x[i] > 1 ){
              for(k in y){
                    if( k==i){
                        sub(/^,/,"",y[k])
                        j=split(y[k],calc,",")
                        b=asort(calc,dest)
                        print "seq: " k,dest[1]"-"dest[b]
                    }
              }
          }
          else{
              print i
          }
    }
}
'


BrianK 10-18-2007 12:33 PM

Wow, thanks guys! I wasn't expecting to get answers on this one.

In answer matthewg42 questions -
all the file1's are one group, the file2 & file3 are singles.
if a number is missing, you can consider it two sequences. If the report is on the smart side, it would report something like:

foo.#.end 1-3,5-6

but

foo.#.end 1-3
foo.#.end 5-6


is also acceptable

'#' is arbitrary - it could be anything.. '#' makes sense. '%04d' makes a lot of sense.

but yeah... these give me great starting points. Thanks!


All times are GMT -5. The time now is 11:36 AM.