Algorithm design: number matching

PB0711 · 03-13-2007, 06:41 PM

Hello and thanks,

I normally would put only programming question on here but I'm kind of stuck on this and wondered if anyone has a helpful hint.

I have a list of numbers that I want to match against another set of numbers. It's not an alignment sequence, but just a "how many can I match with x certainty?".
My problem is that I don't know how to tell the user how certain we are of this match, a match coefficient. I know I can create the window and make the window larger but correlating this to a certainty result is where I'm stuck.
If someone could point me to an algorithm that I can look at for this that would be great.

Thanks.

wjevans_7d1@yahoo.co · 03-14-2007, 11:59 AM

I don't understand the "how certain we are" thing. Either they match or they don't. It's either 0% certainty or 100%. What am I missing?

PB0711 · 03-14-2007, 12:40 PM

So I was thinking something like a match would be
2.25 would match 2.20 with a 0.05 window. The bigger the window got the more uncertain we are of the match. And the window would stop after a certain size. We would match as many of the 5 number that we could. The more matches the more certain we are. I guess something like the Jaro-Winkler distance or Needleman-Wunsch algorithm.

EDIT: Thinking I'm going to go with edit distances.
ok cool question answered. Thanks

makyo · 03-14-2007, 02:48 PM

Hi.

I can see how some of the metrics work with strings -- http://en.wikipedia.org/wiki/String_metrics

However, I don't see how you plan to apply these to lists of numbers. I suppose you could assume that you are dealing with very large alphabets.

A few examples might be useful. What you have, and what you could consider as matching (not using the Needleman-Wunsch -- I haven't had much luck understanding that) ... cheers, makyo

PB0711 · 03-14-2007, 10:45 PM

Umm ok so I guess Needleman was a bad choice since it really is truely for strings. But I was thinking something like this :
To match : 56.7 89.6 98.0 102.7 134.9
55.4 90.4 100.9 300.4
56.3 89.4 102.7 134.8 600.5

Seq : 56.6 89.4 102.7 134.8

So what I'm thinking is having a ppm calculator so that the users can specify how big the window of error should be on the number to number match, but on the sequence the program will decide which is the best match.
I'm thinking of going with a Levenshtein distance ?? Good idea your

thought are welcomed.

Thanks,

Paul

EDIT: I have to do this a lot of time on many sequences. Which will increase in the future as the DB grows.

makyo · 03-24-2007, 02:18 PM

Hi.

The fact that you're working with numbers suggests that something along the lines of least squares might be useful.

Measuring the difference between pairs, then minimizing and using a derived measure appeals to me intuitively, but perhaps that's too simplistic.

I see that the genomic sequences codes are looking at millions of pairs: http://mbe.oxfordjournals.org/cgi/co...full/20/8/1299 , so that might involve techniques that are too far from your problem. What are the typical and maximum lengths of the sets with which you are working?

Perhaps a statistician will step in here and provide some advice ... cheers, makyo

PTrenholme · 03-24-2007, 03:25 PM

Well, you could go with an empirical distribution function of the differences:

Let x(1) ... x(n) and y(1) ... y(m) represent your two sets of numbers.
Construct all n*m pairwise differences, x(i)-y(j)
Then, for any observed difference, d, you could estimate the probability that a difference of d or less would occur "by chance" by counting the observed number of x(i)-y(j) values that are less than d and dividing by n*m.

A variation might be to only look at abs(x(i)-y(j)) or, if both x and y come from the same population, you could tabulate all observed differences in the combined set of numbers.

Other variations could be made to take into account the discrete nature of the numbers you''re using, and the resulting probability of duplicated numbers.

Yet another approach is to consider the numbers to represent random, descretized, samples from ether the same or two different populations, and construct either the empirical distribution function(s) for the population(s), and use those estimates to approximate the distribution of the differences and, finally, the probability of any difference being no larger then your observed values.

Finally, the the distribution of statistic (the numbers observed) in the population(s) is known, the "exact" distribution of the the differences could, at least theoretically, be calculated and then the probability of any observed difference being within any specified range.

PB0711 · 03-26-2007, 11:07 AM

Wow thanks for the answers guys. At the moment I'm going to go with the Levenshtein distance idea because it give me a nice way to see how well the match was. I'll probably go into some other matching/distance tech later but at the moment I want to try and get a beta out. Also I have the programming already for the Levenshtein distance and the empirical distance isn't quite connected yet in my head.

Thanks everyone,

Paul

PTrenholme · 03-26-2007, 02:16 PM

Just to clarify, what I was discussing is not an “empirical distance.” What I discussed was some way of estimating the probability that the “distance” between any pair of numbers (measured in my example by the simple difference, but any measure could be used) would be less then any observed difference.

Once that probability has been estimated, you can talk about the “likelihood” that some difference would occur “by chance,” and then the “significance” of any difference, or the likelihood that any two number are equal, which is what I thought you were trying to accomplish.

Note that (other than the final part of my discussion) I was suggesting a nonparametric method which yields a probability estimate conditional on the observed numeric values which may (or may not) be appropriate fo your problem.