Applied Math Problem: Identify Outlier in Numeric Trend
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Applied Math Problem: Identify Outlier in Numeric Trend
I'm a DIY programmer working on a GIS topic in PHP.
I've got an array of elevations in meters, in a time series. The elevation values are sampled frequently and definitely trend slowly up/down. It's from a garmin gps and sometimes the values written cannot be possible.
Good set: 33.33, 33.35, 33.40, 33.41, 33.41, 33.41, 33.39
Bad set: 33.33, 33.35, 36.8, 38.9, 44.00, 33.41, 33.39
In the bad set, a road cannot climb 3 meters in about 1 second, and then suddenly fall down. GPS collection isn't perfect in these small devices.
I tried a z-score with a standard deviation and it sort of works, but the z-score differences are small.
Does someone know of a trend analysis type algorithm that would easily identify those bad values? I was looking at a low-pass filter, but it's not clear if that is overkill, or even a good tool.
If it's not obvious, I am not mathematically talented.
For your particular problem, I wouldn't approach it statistically or with a filter: just do it the way you identified it. Put a limit on the physically possible. Flag any values where the change is implausible, say, greater than 0.25 meters. You can change the threshold number with some experimentation.
[edit] Strictly speaking, what I have suggested is a filter, though simple. I am a big believer in less processing of data with fewer assumptions. I did assume your data was a fixed time interval time series from your description.
Last edited by mostlyharmless; 05-10-2016 at 03:07 PM.
Mostlyharmless' suggestion of placing some limit, or error band on possible values is sound, but your description is lacking in enough detail to make a good choice for those limits.
You say that the samples are a time series, "sampled frequently", but you need to define that sample frequency much better, as well as how that relates to displacement in space, or speed.
If the samples are in linear time, once per second for example, then how much change in elevation is too much depends critically on speed.
On the other hand, if samples are ticked off by distance such as one samaple per X-rotations of a wheel, then it is not a time series at all, speed does not appear in the data and it would be somewhat easier to say how much is too much.
There is also the possibility that samples are neither time nor distance related, such as if the driver is told to take a sample at each intersection, in which case you can't easily pick an error limit as it would then be route dependent!
Based upon "GPS data in a time series" I assume it is just linear time maybe 1 sample every 1-5 seconds but I could be worng.
Is this a GPS receiver that just logs position data to a file that you download or are you recording "raw" NMEA messages. If the later is confusing then never mind. Just curious because if you were recording NMEA messages then you could possibly check for altitude accuracy at that time and throw out that value without having to do much extra post processing.
If the samples are in linear time, once per second for example, then how much change in elevation is too much depends critically on speed.
We're talking about conventional roads here and high frequency sample rates like 1 per second. There's no road where you gain/lose 1 meter or more in just a second.
I think I'm going to do a moving average as that will catch the bad readings, then I can replace the bad reading with the moving average.
Based upon "GPS data in a time series" I assume it is just linear time maybe 1 sample every 1-5 seconds but I could be worng.
Is this a GPS receiver that just logs position data to a file that you download or are you recording "raw" NMEA messages. If the later is confusing then never mind. Just curious because if you were recording NMEA messages then you could possibly check for altitude accuracy at that time and throw out that value without having to do much extra post processing.
At the moment I'm using a mobile phone app that writes your track to .gpx file. Bicyclists and runners use standalone GPS devices from brands like Garmin. It's the same idea, only using my mobile phone.
I am checking altitude accuracy using USGS data. The altitude written into the .gpx file from my mobile phone is very wrong. But, checking *every* point when the file is one line per second is not really useful and time consuming. My idea is to check a few points and adjust the .gpx data to get close enough.
Eventually, I'll do the same adjustments for the occasional bad longitude/latitude values. That's a different problem though.
We're talking about conventional roads here and high frequency sample rates like 1 per second. There's no road where you gain/lose 1 meter or more in just a second.
Sure there are road conditions where you can gain/lose 1 meter in less than a second, you're not considering velocity here are you? You can sample GPS faster; however it also depends if the correlators in the device will provide you with output faster. We've used 5 Hz and 10 Hz on devices, but there are faster ones, just not commercial. I recommend something using a binary protocol like OSP or other proprietary binary protocols so you can get much more detailed information. However this all sort of falls apart because commercial GPS promises +/- 3 meter accuracy.
Quote:
Originally Posted by michaelk
Just as an FYI but navigation systems use a Kalman filter to get rid of outlying sensor data. Much to complicated for your project.
Agreed, and whether or not you're looking directly at the messages, those messages are already past the correlators in the device. There are no GPS devices where you can see their raw data as received from the satellites, they already have their filter in the mix, because that is their product.
You might want to look into one of the highly accurate GPS devices which claim to use dead reckoning. However my caveat is that while we've "tried", two or "ten" things interfere. Firstly, the accuracy of the accelerometers on a phone is horrible and they'll want 16-bit accelerometer accuracy, in 10 dimensions. staticXYZ, magXYZ, gyroXYZ, and for your case, barometric. Secondly, all the vendors who promise this, somehow can't give me a demo/devkit board which actually works. They just tell me, "Oh, we've used the blah-blah chip with our GPS module, ... that works." and then they never answer the phone or email again. You need to calibrate the position sensors, ensure that there are no magnetic interferences (and on a road, there always are, cars/trucks, sign posts, iron content in the soil, underground pipes). And you'll need to control the temperature with enough stability of get the gyros accurate.
For your particular problem, I wouldn't approach it statistically or with a filter: just do it the way you identified it. Put a limit on the physically possible. Flag any values where the change is implausible, say, greater than 0.25 meters. You can change the threshold number with some experimentation.
[edit] Strictly speaking, what I have suggested is a filter, though simple. I am a big believer in less processing of data with fewer assumptions. I did assume your data was a fixed time interval time series from your description.
This is what I'm going to do. I'm time constrained and I too prefer the simplest answer. I'm using consumer level gps devices and that's all. I'll do a moving average as that isn't hard to do with a small array. array_push/array_pop
You can also use other statistics like "Standard Deviation."
You might wish to calculate the distance (Pythagorean Theorem ...) from one point to the next, and consider whether it just had you moving several miles in a few seconds' time. If so, consider larger ranges until the distance ... the velocity of the traveler ... becomes "plausible" again. The intermediate points within that range might be outliers.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.