Finding matching patterns in 2 files

herveld · 11-19-2008, 08:35 AM

Quote:

Originally Posted by PTrenholme

Output for your test files:

Code:

[tmp]$ gawk -v kmin=2 -v field=2 -f match2.awk file1 file2

Match:  file1   file2   Value
        4       3       bbb
        5       4       aaa
        6       5       ccc

Thanks again, looks pretty good!

Just wondering how come it gives no result for kmin=4 ?
Output should be :
Match: file1 file2 Value
3 2 aaa
4 3 bbb
5 4 aaa
6 5 ccc

PTrenholme · 11-20-2008, 09:04 AM

Ah, my bad!

I started with the first program I wrote which matched single lines, and that logic constrains sequences to have unique values. I didn't change that part when I modified the code, so . . .

Anyhow, I'll have to think about it some more.

Note, please, that you're now seeing that programmers, like computers, do what they understand your problem to be, not what you think they understand. An example of why they're always complaining about clients "changing the specifications" during the course of a project. Most often, what they're really saying is that they and the client failed to understand each other at the start of the project.

Aw well, let me see if I can get it right. More later.

PTrenholme · 11-24-2008, 09:29 PM

Sorry, this took a while.

Here's the output from my latest iteration on trying to guess what you needed. Note that I added a couple lines to your test "file2" so you could see what happens when one sequence on the letf matches several sequences on the right.

One thing that wasn't clear was whether or not you wanted to see only the maximal length matching sequences, or all matches of more than kmin characters. In this code I opted to show all the matching sequences.

First, the test files:

Code:

$ cat file1
1 111
2 222
3 aaa
4 bbb
5 aaa
6 ccc
7 333
$ cat file2
1 444
2 aaa
3 bbb
4 aaa
5 ccc
6 555
7 666
8 aaa
9 ccc

And here's the output produced:

Code:

$ gawk -f match5 -v field=2 -v kmin=2 file1 file2
File "file1" contained 7 records.
File "file2" contained 9 records.

7 sequences matched between "file1" and "file2":

        Sequence 1:
                "aaa"   3       2
                "bbb"   4       3

        Sequence 2:
                "aaa"   3       2
                "bbb"   4       3
                "aaa"   5       4

        Sequence 3:
                "aaa"   3       2
                "bbb"   4       3
                "aaa"   5       4
                "ccc"   6       5

        Sequence 4:
                "bbb"   4       3
                "aaa"   5       4

        Sequence 5:
                "bbb"   4       3
                "aaa"   5       4
                "ccc"   6       5

        Sequence 6:
                "aaa"   5       4
                "ccc"   6       5

        Sequence 7:
                "aaa"   5       8
                "ccc"   6       9

I also tried it on a couple of larger files. What I used were two chapters of one of Anvil's novels. Note the FS= setting in the command line to tell gawk to treat each word as a record. (Perhaps a quick and dirty check for plagiarism, eh? Although a "good" checker would need to be somewhat more sophisticated.)

Code:

$ gawk -f match5 -v RS="[[:punct:] ]" -v kmin=5 t1 t2
File "t1" contained 3153 records.                                        
File "t2" contained 6786 records.                                        

4 sequences matched between "t1" and "t2":

        Sequence 1:
                "of"    2389    4585
                "what"  2390    4586
                "appeared"      2391    4587
                "to"    2392    4588
                "be"    2393    4589

        Sequence 2:
                "opened"        2891    851
                "his"   2892    852
                "mouth" 2893    853
                "and"   2895    854
                "shut"  2896    855

        Sequence 3:
                "opened"        2891    851
                "his"   2892    852
                "mouth" 2893    853
                "and"   2895    854
                "shut"  2896    855
                "it"    2897    856

        Sequence 4:
                "his"   2892    852
                "mouth" 2893    853
                "and"   2895    854
                "shut"  2896    855
                "it"    2897    856

Here's the gawk program to does the above. Note that this code uses GNU awk extensions and, therefore, may not be portable to other awk implementations.

PHP Code:



############################################################################### 
# 
# genmatch - Replacement of match that (optionally) returns the matched 
#            string(s) in an array 
# 
# WARNING: REGEXP must be a string, not a regular expression constant. 
# 
# Function return: Number of matches made 
# 
############################################################################### 
function genmatch(TARGET,    # String in which to search for matches 
          REGEXP,    # Expression to use to identify target strings 
          MATCHED,    # Array to contain the matched strings 
# Local variables 
        target,        # Local copy of TARGET, modified as needed 
        retv,        # "match" return array 
        i,        # Loop index 
        n)        # Number of matches made 
{ 
  # Clean out the return array 
  delete MATCHED; 
  # Create a local copy of the target string 
  target = (TARGET) ? TARGET : $0; 
  n = 0; 
  # For each match in target . . . 
  while (match(target, REGEXP, retv)) { 
    # Set the MATCHED value and increment the count 
    MATCHED[++n] = retv[0]; 
    # And, for any matched sub-expressions, add them to MATCHED 
    i = 0; 
    while (retv[++i]) { 
      MATCHED[n,i]=retv[i]; 
    } 
    # Now remove the matched expression from the target string so we can find 
    # the next match, if any. 
    sub(REGEXP, "", target); 
  } 
  # All done. Return the number of matched strings found 
  return n; 
} 
############################################################################### 
# 
# Function to find matched sequences two files 
# 
# "Global" variables used: 
# 
#    kmin    Minimum number of matched values needed to define a "sequence" 
#    sep    Non-numeric character which does not occur in any match target 
#    records    Number of records in each file 
#    data    Match values extracted from each input file 
# 
# Note: More than one identical sequence may occur in any file, 
# 
# Function return: The number of matched sequences found 
# 
################################################################################ 
function match_seq(left,    # "left-hand" file name 
           right,    # "right-hand" file name 
           ret,        # Matched sequence(s) 
    # Local variables 
           l_values,    # Left file string:line# pairs 
           nv,        # Number of values (work variable - meaning changes) 
           ns,        # Number of matched sequences found 
           test,    # Candidate sequence modified for search 
           source,    # Left file candidate sequence before modification 
           fields,    # match() return array 
           seq,        # Length of current sequence 
           seq_v,    # Values in current sequence 
           rows,    # Right file matched row number array 
           i,        # Loop index 
           j,        # Loop index 
           k,         # Temp integer 
           l,        # Loop index 
           ll,        # Loop index 
           n_l,        # Length of "left file" 
           n_r)        # Length of "right file" 
{ 
  # Get the number of match fields in the two files 
  n_l = records[left]; 
  n_r = records[right]; 
  # Initialize the number of matched sequences to zero 
  seq = 0; 
  delete ret; 
 
  # Sanity check: Return "no match found" if either file is to short to contain ANY match sequence 
  if ((n_l < kmin) || (n_r < kmin)) { 
    return seq; 
  } 
 
  # Get the number of match values in the left file 
  nv = split(data[left], l_values, sep); 
   
  # Find all sequences in "left" matching any sequences in "right" 
  test = ""; 
  source = ""; 
  for (l = 2; l < nv - 1; ++l) { 
    if (!match(l_values[l], /^(.*):([[:digit:]]+)$/, fields)) { # This fills "fields" with the field value and record number 
      printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", l, left, l_values[l]); 
      exit; 
    } 
    source = sep fields[1] ":" fields[2]; 
    k =  1; 
    values[k] = fields[1]; 
    rows[k]   = fields[2]; 
    test = sep fields[1] ":([[:digit:]]+)" 
    for (ll = l + 1; ll < nv; ++ll) { 
      if (!match(l_values[ll], /^(.*):([[:digit:]]+)$/, fields)) { 
    printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", ll, left, gensub(sep, "|","g",l_values[ll])); 
    exit; 
      } 
      test = test sep fields[1] ":([[:digit:]]+)"; 
      # No point in proceeding if the candidate sequence has failed. 
      if (data[right] !~ test sep) { 
    break; 
      } 
      source = source sep fields[1] ":" fields[2]; 
      values[++k] = fields[1]; 
      rows[k]     = fields[2]; 
      if (ll - l >= kmin - 1) { 
    # Add this sequence to the set of matched sequences if we can find it in "right" 
    if (ns = genmatch(data[right], test sep, seq_v)) { 
      for (i = 1; i <=ns; ++i) { 
        ++seq; 
        for (j = 1; seq_v[i,j]; ++j) { 
          ret["value", seq, j] = values[j]; 
          ret["row", seq, j]   = rows[j]; 
          ret[seq,j] = seq_v[i,j]; 
        } 
      } 
    } 
      } 
    } 
  } 
  return seq; 
} 
 
BEGIN { 
  if (!kmin)    kmin =3;    # Minimum number of consecutive matches (Default: 3) 
  if (!field)    field=0;    # Field on which to match (Default: Whole line) 
  sep  = SUBSEP;        # Non-printing subscript separation character 
  if (out) {            # "csv" matched sequence output file (if specified) 
    printf("\"Left_File\",\"Right_File\",\"Left_Row\",\"Right_Row\",\"Value\"\n") > out; 
  } 
} 
 
# Read the match field data from all input files into memory 
{ 
  value = $ field;                    # Value read from input record 
  if (value == "") next;                # Skip empty records; 
  records[FILENAME] = FNR;                # Input file names (and record count) 
  data[FILENAME] = data[FILENAME] sep value ":" FNR;    # Copy of input file match fields with record 
                            # numbers appended 
} 
 
# At this point we've read all the input into memory 
END { 
  # Count the input files and put their names in the "names" array 
  nfiles = asorti(records, name); 
  # Add a terminating separator to each match field copy 
  for (i = 1; i <= nfiles; ++i) { 
    data[name[i]] = data[name[i]] sep; 
    printf("File \"%s\" contained %d records.\n", name[i], records[name[i]]); 
  } 
  # Find the matched sequences in each pair of input files 
  for (i = 1; i < nfiles; ++i) { 
    for (j = i + 1; j <= nfiles; ++j) { 
      if (nm = match_seq(name[i], name[j], matched)) { 
    printf("\n%d sequences matched between \"%s\" and \"%s\":\n", nm, name[i], name[j]); 
    for (k = 1; k <= nm; ++k) { 
      printf("\n\tSequence %d:\n", k); 
      for (l=1; matched[k,l]; ++l) { 
        printf("\t\t\"%s\"\t%d\t%d\n", matched["value",k,l], matched["row",k,l], matched[k,l]); 
      } 
    } 
      } 
    } 
  } 
}

herveld · 11-26-2008, 02:21 PM

Looks great, thanks a lot !

Is there a way to output only the maximal length matching sequence ?

Hence for the "file1 and file2" tests files, only "Sequence 3" would be output.

PTrenholme · 11-26-2008, 04:58 PM

My point was, that if that were done, you'd miss the "Sequence 7," since the rows 8 and 9 I added to the end of your file 2 match a sub-sequence of "Sequence 3." The code could be changed to handle all that, but it would probably involve more "housekeeping" arrays.

On the other hand, if your (unstated so far) requirements would exclude such matches, than it might be fairly simple to change. Of course, there's another problem: What about short sequences and longer sequences (somewhere else in the same file) that match to the same sequence the the other file? (Of course the shorter sequence would only match to part of the sequence to which the longer sequence matched.)

Have you tried the code on your large files? I'm afraid that the execution time may increase geometrically with files size, and it took several second to compare the two chapters from Anvil's book, so 50,000 record files might take several minutes - or longer - to finish.

herveld · 11-27-2008, 05:04 PM

In fact I don't really care about sub-sequences, only the maximal length matching sequence matters.

I tried your latest code on a 400,000 records file and a 300 records subset file and it took about an hour to complete on my macbook pro. I stated kmin=200 and got 7500 matching sequences (instead of only 1 maximal length matching sequence).

PTrenholme · 11-27-2008, 07:58 PM

OK, but can you define what you mean by "match" and "maximal length?"

For example, "a b c d" matches "a b c e" as "a b c," but matches the first and second "a b c" in "a b c e d a b e a b c." So what should be reported as a match in the second case?

And consider "a b c d" against "a b c e x b c d y a b." Is "a b c" or "b d e" to be reported as the "maximal length" match?

A definition should be something like this:

Quote:

Let X be an ordered set of n elements (x[1], x[2], ... , x[n]) and Y another ordered set of m elements, (y[1], y[2], ..., y[m]) from some domain D.
Let S = {{(i,j)} | (x[i], x[i+1],..., x[j]) for i = 1, 2, ..., n-1 and j=i+1, i+2, ..., n}
and T = {{(r,t)} | (y[r], y[r+1],..., y[m]) for l = 1, 2, ..., m-1 and t=r+1, r+2, ..., m}

Denote by S(i,j) and T(r,t) specific elements of the S and T sets.

Then S(i,j) "matches" T(r,t) when j - i = t - r and x[i]=y[r], x[i+1]=y[r+1], ..., x[j] = y[t] and
there does not exist any element, S(u,v) of S such that S(i,j) is a subset of S(u,v) and . . .

The problem I have is finding out what, precisely, you want to use to fill in the ellipsis at the end of that sample definition.

herveld · 11-28-2008, 10:38 AM

OK, let's say in case of equal length matches such as "a b c" in your first example and "a b c" & "b c d" in your second example, then all of these matching sequences should be reported, given that there is no longer matching sequences.

However, "a b", "b c", or "c d" should not be reported even if kmin=2.

Does that answer your question ?

PTrenholme · 11-28-2008, 11:20 AM

OK, I'll give it a shot. (If my wife doesn't protest too much . . .)

PTrenholme · 11-29-2008, 01:49 PM

OK, here's my attempt to read your mind.

First, here's four test files based on your two original files:

Code:

$ cat file1[/B]
1 111
2 222
3 aaa
4 bbb
5 aaa
6 ccc
7 333
$ cat file2
1 444
2 aaa
3 bbb
4 aaa
5 ccc
6 555
7 666
$ cat file3
1 444
2 aaa
3 bbb
4 aaa
5 ccc
6 555
7 666
8 aaa
9 ccc
$ cat file4
1 444
2 aaa
3 bbb
4 aaa
5 ccc
6 333
7 666

And heres the output. Please look closely at each pair of files to see if the results match what you expect.

Code:

$ gawk -f match7 -v field=2 -v kmin=2 file1 file2 file3 file4
File "file1" contained 7 records.
File "file2" contained 7 records.
File "file3" contained 9 records.
File "file4" contained 7 records.

1 sequence matched between "file1" and "file2":

        Sequence 1:
                "aaa"   3       2
                "bbb"   4       3
                "aaa"   5       4
                "ccc"   6       5

1 sequence matched between "file1" and "file3":

        Sequence 1:
                "aaa"   3       2
                "bbb"   4       3
                "aaa"   5       4
                "ccc"   6       5

1 sequence matched between "file1" and "file4":

        Sequence 1:
                "aaa"   3       2
                "bbb"   4       3
                "aaa"   5       4
                "ccc"   6       5
                "333"   7       6

1 sequence matched between "file2" and "file3":

        Sequence 1:
                "444"   1       1
                "aaa"   2       2
                "bbb"   3       3
                "aaa"   4       4
                "ccc"   5       5
                "555"   6       6
                "666"   7       7

1 sequence matched between "file2" and "file4":

        Sequence 1:
                "444"   1       1
                "aaa"   2       2
                "bbb"   3       3
                "aaa"   4       4
                "ccc"   5       5

2 sequences matched between "file3" and "file4":

        Sequence 1:
                "444"   1       1
                "aaa"   2       2
                "bbb"   3       3
                "aaa"   4       4
                "ccc"   5       5

        Sequence 2:
                "aaa"   8       4
                "ccc"   9       5

Here's the program listing:

Note: The CSV output option has NOT been tested.

Code:

$ cat match
###############################################################################
#
# genmatch - Replacement of match that (optionally) returns the matched
#            string(s) in an array
#
# WARNING: REGEXP must be a string, not a regular expression constant.
#
# Function return: Number of matches made
#
###############################################################################
function genmatch(TARGET,       # String in which to search for matches
                  REGEXP,       # Expression to use to identify target strings
                  MATCHED,      # Array to contain the matched strings
# Local variables
                target,         # Local copy of TARGET, modified as needed
                retv,           # "match" return array
                i,              # Loop index
                n)              # Number of matches made
{
  # Clean out the return array
  delete MATCHED;
  # Create a local copy of the target string
  target = (TARGET) ? TARGET : $0;
  n = 0;
  # For each match in target . . .
  while (match(target, REGEXP, retv)) {
    # Set the MATCHED value and increment the count
    MATCHED[++n] = retv[0];
    # And, for any matched sub-expressions, add them to MATCHED
    i = 0;
    while (retv[++i]) {
      MATCHED[n,i]=retv[i];
    }
    # Now remove the matched expression from the target string so we can find
    # the next match, if any.
    sub(REGEXP, "", target);
  }
  # All done. Return the number of matched strings found
  return n;
}
###############################################################################
#
# Function to find matched sequences two files
#
# "Global" variables used:
#
#       kmin    Minimum number of matched values needed to define a "sequence"
#       sep     Non-numeric character which does not occur in any match target
#       records Number of records in each file
#       data    Match values extracted from each input file
#
# Note: More than one identical sequence may occur in any file,
#
# Function return: The number of matched sequences found
#
################################################################################
function match_seq(left,        # "left-hand" file name
                   right,       # "right-hand" file name
                   ret,         # Matched sequence(s) structure
        # Local variables
                   l_values,    # Left file (string:line#) pairs
                   nv,          # Number of values (work variable - meaning changes)
                   ns,          # Number of matched sequences found
                   prior,       # Values in the sequence matched so far
                   source,      # Left file candidate sequence before modification to "test"
                   prior_source,# Value of "source" prior to last candidate sequence value addition
                   test,        # Candidate sequence modified for search
                   prior_test,  # Prior value of "test"
                   fields,      # match() return array
                   seq,         # Length of current candidate sequence
                   seq_v,       # Values in current candidate sequence
                   rows,        # Right file matched row number array
                   i,           # Loop index
                   j,           # Loop index
                   k,           # Temp integer
                   l,           # Loop index
                   ll,          # Loop index
                   n_l,         # Length of "left file"
                   n_r)         # Length of "right file"
{
  # Get the number of match fields in the two files
  n_l = records[left];
  n_r = records[right];
  # Initialize the number of matched sequences to zero
  seq = 0;
  delete ret;

  # Sanity check: Return "no match found" if either file is to short to contain ANY match sequence
  if ((n_l < kmin) || (n_r < kmin)) {
    return seq;
  }

  # Get the number of match values in the left file (plus 2, since there is a "sep" at each end of "data[left]")
  nv = split(data[left], l_values, sep);

  # Find all maximal sequences in "left" matching any sequences in "right"
  for (l = 2; l < nv - 1; ++l) {
    if (!match(l_values[l], /^(.*):([[:digit:]]+)$/, fields)) { # This fills "fields" with the field value and record number
      # This section should be unreachable for correctly prepared input
      printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", l, left, l_values[l]);
      exit;
    }
    source = sep fields[1] ":" fields[2];
    prior_source = "";
    k =  1;
    values[k] = fields[1];
    rows[k]   = fields[2];
    test = sep fields[1] ":([[:digit:]]+)"
    prior_test = test;
    for (ll = l + 1; ll < nv; ++ll) {
      if (!match(l_values[ll], /^(.*):([[:digit:]]+)$/, fields)) {
        # This section should be unreachable if the input is prepared correctly
        printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", ll, left, gensub(sep, "|","g",l_values[ll]));
        exit;
      }
      prior_test = test;
      test = test sep fields[1] ":([[:digit:]]+)";
      # Exit the inner loop if this candidate sequence fails to match in data[right]}
      if (data[right] !~ test sep) {
        test = prior_test;
        source = prior_source;
        --k;
        break;
      }
      prior_source = source;
      source = source sep fields[1] ":" fields[2];
      values[++k] = fields[1];
      rows[k]     = fields[2];
    }
    # Is the candidate long enough?
    if (k + 2 > kmin) {
    # O.K., we've found a live one. Add it to the return structure.
      if (ns = genmatch(data[right], test sep, seq_v)) {
        for (i = 1; i <=ns; ++i) {
          ++seq;
          for (j = 1; seq_v[i,j]; ++j) {
            ret["value", seq, j] = values[j];
            ret["row", seq, j]   = rows[j];
            ret[seq,j] = seq_v[i,j];
          }
        }
        # Don't look for matches from inside a matched sequence.
        l = l + k;
      }
      else {
      # This section should never be reached since we've already verified the existence of at least one match
        printf("match_seq: Fatal error: Confirmed sequence \"%s\" not found in \"%s\".\n",
                gensub(/:\(\[\[:digit:\]\]\+\)\+\|/, "|", "g", gensub(sep, "|", "g", test sep)), right);
        exit;
      }
    }
  }
  return seq;
}

BEGIN {
  if (!kmin)    kmin =3;        # Minimum number of consecutive matches (Default: 3)
  if (!field)   field=0;        # Field on which to match (Default: Whole line)
  sep  = SUBSEP;                # Non-printing subscript separation character
  if (out) {                    # "csv" matched sequence output file (if specified)
    printf("\"Left_File\",\"Right_File\",\"Left_Row\",\"Right_Row\",\"Value\"\n") > out;
  }
}

# Read the match field data from all input files into memory
{
  value = $ field;                                      # Value read from input record
  if (value == "") next;                                # Skip empty records;
  records[FILENAME] = FNR;                              # Input file names (and record count)
  data[FILENAME] = data[FILENAME] sep value ":" FNR;    # Copy of input file match fields with record
                                                        # numbers appended
}

# At this point we've read all the input into memory
END {
  # Count the input files and put their names in the "names" array
  nfiles = asorti(records, name);
  # Add a terminating separator to each match field copy
  for (i = 1; i <= nfiles; ++i) {
    data[name[i]] = data[name[i]] sep;
    printf("File \"%s\" contained %d records.\n", name[i], records[name[i]]);
  }
  # Find the matched sequences in each pair of input files
  for (i = 1; i < nfiles; ++i) {
    for (j = i + 1; j <= nfiles; ++j) {
      if (nm = match_seq(name[i], name[j], matched)) {
        printf("\n%d sequence%s matched between \"%s\" and \"%s\":\n", nm, (nm==1)?"":"s", name[i], name[j]);
        for (k = 1; k <= nm; ++k) {
          printf("\n\tSequence %d:\n", k);
          for (l=1; matched[k,l]; ++l) {
            printf("\t\t\"%s\"\t%d\t%d\n", matched["value",k,l], matched["row",k,l], matched[k,l]);
          }
        }
      }
    }
  }
}

herveld · 12-01-2008, 03:35 PM

Looks like a winner, thank you so much!

You are the king of gawk!