ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I started with the first program I wrote which matched single lines, and that logic constrains sequences to have unique values. I didn't change that part when I modified the code, so . . .
Anyhow, I'll have to think about it some more.
Note, please, that you're now seeing that programmers, like computers, do what they understand your problem to be, not what you think they understand. An example of why they're always complaining about clients "changing the specifications" during the course of a project. Most often, what they're really saying is that they and the client failed to understand each other at the start of the project.
Aw well, let me see if I can get it right. More later.
Here's the output from my latest iteration on trying to guess what you needed. Note that I added a couple lines to your test "file2" so you could see what happens when one sequence on the letf matches several sequences on the right.
One thing that wasn't clear was whether or not you wanted to see only the maximal length matching sequences, or all matches of more than kmin characters. In this code I opted to show all the matching sequences.
I also tried it on a couple of larger files. What I used were two chapters of one of Anvil's novels. Note the FS= setting in the command line to tell gawk to treat each word as a record. (Perhaps a quick and dirty check for plagiarism, eh? Although a "good" checker would need to be somewhat more sophisticated.)
Here's the gawk program to does the above. Note that this code uses GNU awk extensions and, therefore, may not be portable to other awk implementations.
PHP Code:
############################################################################### # # genmatch - Replacement of match that (optionally) returns the matched # string(s) in an array # # WARNING: REGEXP must be a string, not a regular expression constant. # # Function return: Number of matches made # ############################################################################### function genmatch(TARGET, # String in which to search for matches REGEXP, # Expression to use to identify target strings MATCHED, # Array to contain the matched strings # Local variables target, # Local copy of TARGET, modified as needed retv, # "match" return array i, # Loop index n) # Number of matches made { # Clean out the return array delete MATCHED; # Create a local copy of the target string target = (TARGET) ? TARGET : $0; n = 0; # For each match in target . . . while (match(target, REGEXP, retv)) { # Set the MATCHED value and increment the count MATCHED[++n] = retv[0]; # And, for any matched sub-expressions, add them to MATCHED i = 0; while (retv[++i]) { MATCHED[n,i]=retv[i]; } # Now remove the matched expression from the target string so we can find # the next match, if any. sub(REGEXP, "", target); } # All done. Return the number of matched strings found return n; } ############################################################################### # # Function to find matched sequences two files # # "Global" variables used: # # kmin Minimum number of matched values needed to define a "sequence" # sep Non-numeric character which does not occur in any match target # records Number of records in each file # data Match values extracted from each input file # # Note: More than one identical sequence may occur in any file, # # Function return: The number of matched sequences found # ################################################################################ function match_seq(left, # "left-hand" file name right, # "right-hand" file name ret, # Matched sequence(s) # Local variables l_values, # Left file string:line# pairs nv, # Number of values (work variable - meaning changes) ns, # Number of matched sequences found test, # Candidate sequence modified for search source, # Left file candidate sequence before modification fields, # match() return array seq, # Length of current sequence seq_v, # Values in current sequence rows, # Right file matched row number array i, # Loop index j, # Loop index k, # Temp integer l, # Loop index ll, # Loop index n_l, # Length of "left file" n_r) # Length of "right file" { # Get the number of match fields in the two files n_l = records[left]; n_r = records[right]; # Initialize the number of matched sequences to zero seq = 0; delete ret;
# Sanity check: Return "no match found" if either file is to short to contain ANY match sequence if ((n_l < kmin) || (n_r < kmin)) { return seq; }
# Get the number of match values in the left file nv = split(data[left], l_values, sep);
# Find all sequences in "left" matching any sequences in "right" test = ""; source = ""; for (l = 2; l < nv - 1; ++l) { if (!match(l_values[l], /^(.*):([[:digit:]]+)$/, fields)) { # This fills "fields" with the field value and record number printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", l, left, l_values[l]); exit; } source = sep fields[1] ":" fields[2]; k = 1; values[k] = fields[1]; rows[k] = fields[2]; test = sep fields[1] ":([[:digit:]]+)" for (ll = l + 1; ll < nv; ++ll) { if (!match(l_values[ll], /^(.*):([[:digit:]]+)$/, fields)) { printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", ll, left, gensub(sep, "|","g",l_values[ll])); exit; } test = test sep fields[1] ":([[:digit:]]+)"; # No point in proceeding if the candidate sequence has failed. if (data[right] !~ test sep) { break; } source = source sep fields[1] ":" fields[2]; values[++k] = fields[1]; rows[k] = fields[2]; if (ll - l >= kmin - 1) { # Add this sequence to the set of matched sequences if we can find it in "right" if (ns = genmatch(data[right], test sep, seq_v)) { for (i = 1; i <=ns; ++i) { ++seq; for (j = 1; seq_v[i,j]; ++j) { ret["value", seq, j] = values[j]; ret["row", seq, j] = rows[j]; ret[seq,j] = seq_v[i,j]; } } } } } } return seq; }
BEGIN { if (!kmin) kmin =3; # Minimum number of consecutive matches (Default: 3) if (!field) field=0; # Field on which to match (Default: Whole line) sep = SUBSEP; # Non-printing subscript separation character if (out) { # "csv" matched sequence output file (if specified) printf("\"Left_File\",\"Right_File\",\"Left_Row\",\"Right_Row\",\"Value\"\n") > out; } }
# Read the match field data from all input files into memory { value = $ field; # Value read from input record if (value == "") next; # Skip empty records; records[FILENAME] = FNR; # Input file names (and record count) data[FILENAME] = data[FILENAME] sep value ":" FNR; # Copy of input file match fields with record # numbers appended }
# At this point we've read all the input into memory END { # Count the input files and put their names in the "names" array nfiles = asorti(records, name); # Add a terminating separator to each match field copy for (i = 1; i <= nfiles; ++i) { data[name[i]] = data[name[i]] sep; printf("File \"%s\" contained %d records.\n", name[i], records[name[i]]); } # Find the matched sequences in each pair of input files for (i = 1; i < nfiles; ++i) { for (j = i + 1; j <= nfiles; ++j) { if (nm = match_seq(name[i], name[j], matched)) { printf("\n%d sequences matched between \"%s\" and \"%s\":\n", nm, name[i], name[j]); for (k = 1; k <= nm; ++k) { printf("\n\tSequence %d:\n", k); for (l=1; matched[k,l]; ++l) { printf("\t\t\"%s\"\t%d\t%d\n", matched["value",k,l], matched["row",k,l], matched[k,l]); } } } } } }
My point was, that if that were done, you'd miss the "Sequence 7," since the rows 8 and 9 I added to the end of your file 2 match a sub-sequence of "Sequence 3." The code could be changed to handle all that, but it would probably involve more "housekeeping" arrays.
On the other hand, if your (unstated so far) requirements would exclude such matches, than it might be fairly simple to change. Of course, there's another problem: What about short sequences and longer sequences (somewhere else in the same file) that match to the same sequence the the other file? (Of course the shorter sequence would only match to part of the sequence to which the longer sequence matched.)
Have you tried the code on your large files? I'm afraid that the execution time may increase geometrically with files size, and it took several second to compare the two chapters from Anvil's book, so 50,000 record files might take several minutes - or longer - to finish.
In fact I don't really care about sub-sequences, only the maximal length matching sequence matters.
I tried your latest code on a 400,000 records file and a 300 records subset file and it took about an hour to complete on my macbook pro. I stated kmin=200 and got 7500 matching sequences (instead of only 1 maximal length matching sequence).
OK, but can you define what you mean by "match" and "maximal length?"
For example, "a b c d" matches "a b c e" as "a b c," but matches the first and second "a b c" in "a b c e d a b e a b c." So what should be reported as a match in the second case?
And consider "a b c d" against "a b c e x b c d y a b." Is "a b c" or "b d e" to be reported as the "maximal length" match?
A definition should be something like this:
Quote:
Let X be an ordered set of n elements (x[1], x[2], ... , x[n]) and Y another ordered set of m elements, (y[1], y[2], ..., y[m]) from some domain D.
Let S = {{(i,j)} | (x[i], x[i+1],..., x[j]) for i = 1, 2, ..., n-1 and j=i+1, i+2, ..., n}
and T = {{(r,t)} | (y[r], y[r+1],..., y[m]) for l = 1, 2, ..., m-1 and t=r+1, r+2, ..., m}
Denote by S(i,j) and T(r,t) specific elements of the S and T sets.
Then S(i,j) "matches" T(r,t) when j - i = t - r and x[i]=y[r], x[i+1]=y[r+1], ..., x[j] = y[t] and
there does not exist any element, S(u,v) of S such that S(i,j) is a subset of S(u,v) and . . .
The problem I have is finding out what, precisely, you want to use to fill in the ellipsis at the end of that sample definition.
OK, let's say in case of equal length matches such as "a b c" in your first example and "a b c" & "b c d" in your second example, then all of these matching sequences should be reported, given that there is no longer matching sequences.
However, "a b", "b c", or "c d" should not be reported even if kmin=2.
$ cat match
###############################################################################
#
# genmatch - Replacement of match that (optionally) returns the matched
# string(s) in an array
#
# WARNING: REGEXP must be a string, not a regular expression constant.
#
# Function return: Number of matches made
#
###############################################################################
function genmatch(TARGET, # String in which to search for matches
REGEXP, # Expression to use to identify target strings
MATCHED, # Array to contain the matched strings
# Local variables
target, # Local copy of TARGET, modified as needed
retv, # "match" return array
i, # Loop index
n) # Number of matches made
{
# Clean out the return array
delete MATCHED;
# Create a local copy of the target string
target = (TARGET) ? TARGET : $0;
n = 0;
# For each match in target . . .
while (match(target, REGEXP, retv)) {
# Set the MATCHED value and increment the count
MATCHED[++n] = retv[0];
# And, for any matched sub-expressions, add them to MATCHED
i = 0;
while (retv[++i]) {
MATCHED[n,i]=retv[i];
}
# Now remove the matched expression from the target string so we can find
# the next match, if any.
sub(REGEXP, "", target);
}
# All done. Return the number of matched strings found
return n;
}
###############################################################################
#
# Function to find matched sequences two files
#
# "Global" variables used:
#
# kmin Minimum number of matched values needed to define a "sequence"
# sep Non-numeric character which does not occur in any match target
# records Number of records in each file
# data Match values extracted from each input file
#
# Note: More than one identical sequence may occur in any file,
#
# Function return: The number of matched sequences found
#
################################################################################
function match_seq(left, # "left-hand" file name
right, # "right-hand" file name
ret, # Matched sequence(s) structure
# Local variables
l_values, # Left file (string:line#) pairs
nv, # Number of values (work variable - meaning changes)
ns, # Number of matched sequences found
prior, # Values in the sequence matched so far
source, # Left file candidate sequence before modification to "test"
prior_source,# Value of "source" prior to last candidate sequence value addition
test, # Candidate sequence modified for search
prior_test, # Prior value of "test"
fields, # match() return array
seq, # Length of current candidate sequence
seq_v, # Values in current candidate sequence
rows, # Right file matched row number array
i, # Loop index
j, # Loop index
k, # Temp integer
l, # Loop index
ll, # Loop index
n_l, # Length of "left file"
n_r) # Length of "right file"
{
# Get the number of match fields in the two files
n_l = records[left];
n_r = records[right];
# Initialize the number of matched sequences to zero
seq = 0;
delete ret;
# Sanity check: Return "no match found" if either file is to short to contain ANY match sequence
if ((n_l < kmin) || (n_r < kmin)) {
return seq;
}
# Get the number of match values in the left file (plus 2, since there is a "sep" at each end of "data[left]")
nv = split(data[left], l_values, sep);
# Find all maximal sequences in "left" matching any sequences in "right"
for (l = 2; l < nv - 1; ++l) {
if (!match(l_values[l], /^(.*):([[:digit:]]+)$/, fields)) { # This fills "fields" with the field value and record number
# This section should be unreachable for correctly prepared input
printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", l, left, l_values[l]);
exit;
}
source = sep fields[1] ":" fields[2];
prior_source = "";
k = 1;
values[k] = fields[1];
rows[k] = fields[2];
test = sep fields[1] ":([[:digit:]]+)"
prior_test = test;
for (ll = l + 1; ll < nv; ++ll) {
if (!match(l_values[ll], /^(.*):([[:digit:]]+)$/, fields)) {
# This section should be unreachable if the input is prepared correctly
printf("match_seq: Fatal error: Record %d of \"%s\" (\"%s\") could not be parsed.\n", ll, left, gensub(sep, "|","g",l_values[ll]));
exit;
}
prior_test = test;
test = test sep fields[1] ":([[:digit:]]+)";
# Exit the inner loop if this candidate sequence fails to match in data[right]}
if (data[right] !~ test sep) {
test = prior_test;
source = prior_source;
--k;
break;
}
prior_source = source;
source = source sep fields[1] ":" fields[2];
values[++k] = fields[1];
rows[k] = fields[2];
}
# Is the candidate long enough?
if (k + 2 > kmin) {
# O.K., we've found a live one. Add it to the return structure.
if (ns = genmatch(data[right], test sep, seq_v)) {
for (i = 1; i <=ns; ++i) {
++seq;
for (j = 1; seq_v[i,j]; ++j) {
ret["value", seq, j] = values[j];
ret["row", seq, j] = rows[j];
ret[seq,j] = seq_v[i,j];
}
}
# Don't look for matches from inside a matched sequence.
l = l + k;
}
else {
# This section should never be reached since we've already verified the existence of at least one match
printf("match_seq: Fatal error: Confirmed sequence \"%s\" not found in \"%s\".\n",
gensub(/:\(\[\[:digit:\]\]\+\)\+\|/, "|", "g", gensub(sep, "|", "g", test sep)), right);
exit;
}
}
}
return seq;
}
BEGIN {
if (!kmin) kmin =3; # Minimum number of consecutive matches (Default: 3)
if (!field) field=0; # Field on which to match (Default: Whole line)
sep = SUBSEP; # Non-printing subscript separation character
if (out) { # "csv" matched sequence output file (if specified)
printf("\"Left_File\",\"Right_File\",\"Left_Row\",\"Right_Row\",\"Value\"\n") > out;
}
}
# Read the match field data from all input files into memory
{
value = $ field; # Value read from input record
if (value == "") next; # Skip empty records;
records[FILENAME] = FNR; # Input file names (and record count)
data[FILENAME] = data[FILENAME] sep value ":" FNR; # Copy of input file match fields with record
# numbers appended
}
# At this point we've read all the input into memory
END {
# Count the input files and put their names in the "names" array
nfiles = asorti(records, name);
# Add a terminating separator to each match field copy
for (i = 1; i <= nfiles; ++i) {
data[name[i]] = data[name[i]] sep;
printf("File \"%s\" contained %d records.\n", name[i], records[name[i]]);
}
# Find the matched sequences in each pair of input files
for (i = 1; i < nfiles; ++i) {
for (j = i + 1; j <= nfiles; ++j) {
if (nm = match_seq(name[i], name[j], matched)) {
printf("\n%d sequence%s matched between \"%s\" and \"%s\":\n", nm, (nm==1)?"":"s", name[i], name[j]);
for (k = 1; k <= nm; ++k) {
printf("\n\tSequence %d:\n", k);
for (l=1; matched[k,l]; ++l) {
printf("\t\t\"%s\"\t%d\t%d\n", matched["value",k,l], matched["row",k,l], matched[k,l]);
}
}
}
}
}
}
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.