ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I normally would put only programming question on here but I'm kind of stuck on this and wondered if anyone has a helpful hint.
I have a list of numbers that I want to match against another set of numbers. It's not an alignment sequence, but just a "how many can I match with x certainty?".
My problem is that I don't know how to tell the user how certain we are of this match, a match coefficient. I know I can create the window and make the window larger but correlating this to a certainty result is where I'm stuck.
If someone could point me to an algorithm that I can look at for this that would be great.
So I was thinking something like a match would be
2.25 would match 2.20 with a 0.05 window. The bigger the window got the more uncertain we are of the match. And the window would stop after a certain size. We would match as many of the 5 number that we could. The more matches the more certain we are. I guess something like the Jaro-Winkler distance or Needleman-Wunsch algorithm.
EDIT: Thinking I'm going to go with edit distances.
ok cool question answered. Thanks
However, I don't see how you plan to apply these to lists of numbers. I suppose you could assume that you are dealing with very large alphabets.
A few examples might be useful. What you have, and what you could consider as matching (not using the Needleman-Wunsch -- I haven't had much luck understanding that) ... cheers, makyo
Umm ok so I guess Needleman was a bad choice since it really is truely for strings. But I was thinking something like this :
To match : 56.7 89.6 98.0 102.7 134.9
55.4 90.4 100.9 300.4
56.3 89.4 102.7 134.8 600.5
Seq : 56.6 89.4 102.7 134.8
So what I'm thinking is having a ppm calculator so that the users can specify how big the window of error should be on the number to number match, but on the sequence the program will decide which is the best match.
I'm thinking of going with a Levenshtein distance ?? Good idea your thought are welcomed.
Thanks,
Paul
EDIT: I have to do this a lot of time on many sequences. Which will increase in the future as the DB grows.
The fact that you're working with numbers suggests that something along the lines of least squares might be useful.
Measuring the difference between pairs, then minimizing and using a derived measure appeals to me intuitively, but perhaps that's too simplistic.
I see that the genomic sequences codes are looking at millions of pairs: http://mbe.oxfordjournals.org/cgi/co...full/20/8/1299 , so that might involve techniques that are too far from your problem. What are the typical and maximum lengths of the sets with which you are working?
Perhaps a statistician will step in here and provide some advice ... cheers, makyo
Well, you could go with an empirical distribution function of the differences:
Let x(1) ... x(n) and y(1) ... y(m) represent your two sets of numbers.
Construct all n*m pairwise differences, x(i)-y(j)
Then, for any observed difference, d, you could estimate the probability that a difference of d or less would occur "by chance" by counting the observed number of x(i)-y(j) values that are less than d and dividing by n*m.
A variation might be to only look at abs(x(i)-y(j)) or, if both x and y come from the same population, you could tabulate all observed differences in the combined set of numbers.
Other variations could be made to take into account the discrete nature of the numbers you''re using, and the resulting probability of duplicated numbers.
Yet another approach is to consider the numbers to represent random, descretized, samples from ether the same or two different populations, and construct either the empirical distribution function(s) for the population(s), and use those estimates to approximate the distribution of the differences and, finally, the probability of any difference being no larger then your observed values.
Finally, the the distribution of statistic (the numbers observed) in the population(s) is known, the "exact" distribution of the the differences could, at least theoretically, be calculated and then the probability of any observed difference being within any specified range.
Wow thanks for the answers guys. At the moment I'm going to go with the Levenshtein distance idea because it give me a nice way to see how well the match was. I'll probably go into some other matching/distance tech later but at the moment I want to try and get a beta out. Also I have the programming already for the Levenshtein distance and the empirical distance isn't quite connected yet in my head.
Just to clarify, what I was discussing is not an “empirical distance.” What I discussed was some way of estimating the probability that the “distance” between any pair of numbers (measured in my example by the simple difference, but any measure could be used) would be less then any observed difference.
Once that probability has been estimated, you can talk about the “likelihood” that some difference would occur “by chance,” and then the “significance” of any difference, or the likelihood that any two number are equal, which is what I thought you were trying to accomplish.
Note that (other than the final part of my discussion) I was suggesting a nonparametric method which yields a probability estimate conditional on the observed numeric values which may (or may not) be appropriate fo your problem.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.