LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-13-2007, 06:41 PM   #1
PB0711
Member
 
Registered: Aug 2004
Location: London, UK
Distribution: Ubuntu 10.10, ubuntu 11.04, suse 9.2, OSX
Posts: 259

Rep: Reputation: 30
Algorithm design: number matching


Hello and thanks,

I normally would put only programming question on here but I'm kind of stuck on this and wondered if anyone has a helpful hint.

I have a list of numbers that I want to match against another set of numbers. It's not an alignment sequence, but just a "how many can I match with x certainty?".
My problem is that I don't know how to tell the user how certain we are of this match, a match coefficient. I know I can create the window and make the window larger but correlating this to a certainty result is where I'm stuck.
If someone could point me to an algorithm that I can look at for this that would be great.

Thanks.
 
Old 03-14-2007, 11:59 AM   #2
wjevans_7d1@yahoo.co
Member
 
Registered: Jun 2006
Location: Mariposa
Distribution: Slackware 9.1
Posts: 938

Rep: Reputation: 31
I don't understand the "how certain we are" thing. Either they match or they don't. It's either 0% certainty or 100%. What am I missing?
 
Old 03-14-2007, 12:40 PM   #3
PB0711
Member
 
Registered: Aug 2004
Location: London, UK
Distribution: Ubuntu 10.10, ubuntu 11.04, suse 9.2, OSX
Posts: 259

Original Poster
Rep: Reputation: 30
So I was thinking something like a match would be
2.25 would match 2.20 with a 0.05 window. The bigger the window got the more uncertain we are of the match. And the window would stop after a certain size. We would match as many of the 5 number that we could. The more matches the more certain we are. I guess something like the Jaro-Winkler distance or Needleman-Wunsch algorithm.

EDIT: Thinking I'm going to go with edit distances.
ok cool question answered. Thanks

Last edited by PB0711; 03-14-2007 at 01:17 PM.
 
Old 03-14-2007, 02:48 PM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

I can see how some of the metrics work with strings -- http://en.wikipedia.org/wiki/String_metrics

However, I don't see how you plan to apply these to lists of numbers. I suppose you could assume that you are dealing with very large alphabets.

A few examples might be useful. What you have, and what you could consider as matching (not using the Needleman-Wunsch -- I haven't had much luck understanding that) ... cheers, makyo
 
Old 03-14-2007, 10:45 PM   #5
PB0711
Member
 
Registered: Aug 2004
Location: London, UK
Distribution: Ubuntu 10.10, ubuntu 11.04, suse 9.2, OSX
Posts: 259

Original Poster
Rep: Reputation: 30
Umm ok so I guess Needleman was a bad choice since it really is truely for strings. But I was thinking something like this :
To match : 56.7 89.6 98.0 102.7 134.9
55.4 90.4 100.9 300.4
56.3 89.4 102.7 134.8 600.5


Seq : 56.6 89.4 102.7 134.8

So what I'm thinking is having a ppm calculator so that the users can specify how big the window of error should be on the number to number match, but on the sequence the program will decide which is the best match.
I'm thinking of going with a Levenshtein distance ?? Good idea your thought are welcomed.

Thanks,

Paul

EDIT: I have to do this a lot of time on many sequences. Which will increase in the future as the DB grows.

Last edited by PB0711; 03-14-2007 at 10:47 PM.
 
Old 03-24-2007, 02:18 PM   #6
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

The fact that you're working with numbers suggests that something along the lines of least squares might be useful.

Measuring the difference between pairs, then minimizing and using a derived measure appeals to me intuitively, but perhaps that's too simplistic.

I see that the genomic sequences codes are looking at millions of pairs: http://mbe.oxfordjournals.org/cgi/co...full/20/8/1299 , so that might involve techniques that are too far from your problem. What are the typical and maximum lengths of the sets with which you are working?

Perhaps a statistician will step in here and provide some advice ... cheers, makyo
 
Old 03-24-2007, 03:25 PM   #7
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Well, you could go with an empirical distribution function of the differences:

Let x(1) ... x(n) and y(1) ... y(m) represent your two sets of numbers.
Construct all n*m pairwise differences, x(i)-y(j)
Then, for any observed difference, d, you could estimate the probability that a difference of d or less would occur "by chance" by counting the observed number of x(i)-y(j) values that are less than d and dividing by n*m.

A variation might be to only look at abs(x(i)-y(j)) or, if both x and y come from the same population, you could tabulate all observed differences in the combined set of numbers.

Other variations could be made to take into account the discrete nature of the numbers you''re using, and the resulting probability of duplicated numbers.

Yet another approach is to consider the numbers to represent random, descretized, samples from ether the same or two different populations, and construct either the empirical distribution function(s) for the population(s), and use those estimates to approximate the distribution of the differences and, finally, the probability of any difference being no larger then your observed values.

Finally, the the distribution of statistic (the numbers observed) in the population(s) is known, the "exact" distribution of the the differences could, at least theoretically, be calculated and then the probability of any observed difference being within any specified range.
 
Old 03-26-2007, 11:07 AM   #8
PB0711
Member
 
Registered: Aug 2004
Location: London, UK
Distribution: Ubuntu 10.10, ubuntu 11.04, suse 9.2, OSX
Posts: 259

Original Poster
Rep: Reputation: 30
Wow thanks for the answers guys. At the moment I'm going to go with the Levenshtein distance idea because it give me a nice way to see how well the match was. I'll probably go into some other matching/distance tech later but at the moment I want to try and get a beta out. Also I have the programming already for the Levenshtein distance and the empirical distance isn't quite connected yet in my head.

Thanks everyone,

Paul
 
Old 03-26-2007, 02:16 PM   #9
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Just to clarify, what I was discussing is not an “empirical distance.” What I discussed was some way of estimating the probability that the “distance” between any pair of numbers (measured in my example by the simple difference, but any measure could be used) would be less then any observed difference.

Once that probability has been estimated, you can talk about the “likelihood” that some difference would occur “by chance,” and then the “significance” of any difference, or the likelihood that any two number are equal, which is what I thought you were trying to accomplish.

Note that (other than the final part of my discussion) I was suggesting a nonparametric method which yields a probability estimate conditional on the observed numeric values which may (or may not) be appropriate fo your problem.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Find/grep command to find matching files, print filename, then print matching content stefanlasiewski Programming 9 06-30-2016 05:30 PM
Linux for Graphic Design, web design, and publishing maelstrom209 Linux - Software 8 07-17-2011 11:35 AM
algorithm for fingerprint matching applee Programming 2 02-05-2007 01:02 AM
Algorithm for fingerprint matching applee General 0 01-30-2007 08:24 AM
a small game design algorithm feetyouwell Programming 4 03-23-2004 03:08 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:08 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration