Applied Math Problem: Identify Outlier in Numeric Trend

mpapet · 05-10-2016, 12:10 PM

I'm a DIY programmer working on a GIS topic in PHP.

I've got an array of elevations in meters, in a time series. The elevation values are sampled frequently and definitely trend slowly up/down. It's from a garmin gps and sometimes the values written cannot be possible.

Good set: 33.33, 33.35, 33.40, 33.41, 33.41, 33.41, 33.39

Bad set: 33.33, 33.35, 36.8, 38.9, 44.00, 33.41, 33.39

In the bad set, a road cannot climb 3 meters in about 1 second, and then suddenly fall down. GPS collection isn't perfect in these small devices.

I tried a z-score with a standard deviation and it sort of works, but the z-score differences are small.

Does someone know of a trend analysis type algorithm that would easily identify those bad values? I was looking at a low-pass filter, but it's not clear if that is overkill, or even a good tool.

If it's not obvious, I am not mathematically talented.

mostlyharmless · 05-10-2016, 12:39 PM

For your particular problem, I wouldn't approach it statistically or with a filter: just do it the way you identified it. Put a limit on the physically possible. Flag any values where the change is implausible, say, greater than 0.25 meters. You can change the threshold number with some experimentation.

[edit] Strictly speaking, what I have suggested is a filter, though simple. I am a big believer in less processing of data with fewer assumptions. I did assume your data was a fixed time interval time series from your description.

astrogeek · 05-10-2016, 01:16 PM

Mostlyharmless' suggestion of placing some limit, or error band on possible values is sound, but your description is lacking in enough detail to make a good choice for those limits.

You say that the samples are a time series, "sampled frequently", but you need to define that sample frequency much better, as well as how that relates to displacement in space, or speed.

If the samples are in linear time, once per second for example, then how much change in elevation is too much depends critically on speed.

On the other hand, if samples are ticked off by distance such as one samaple per X-rotations of a wheel, then it is not a time series at all, speed does not appear in the data and it would be somewhat easier to say how much is too much.

There is also the possibility that samples are neither time nor distance related, such as if the driver is told to take a sample at each intersection, in which case you can't easily pick an error limit as it would then be route dependent!

michaelk · 05-10-2016, 02:43 PM

Based upon "GPS data in a time series" I assume it is just linear time maybe 1 sample every 1-5 seconds but I could be worng.

Is this a GPS receiver that just logs position data to a file that you download or are you recording "raw" NMEA messages. If the later is confusing then never mind. Just curious because if you were recording NMEA messages then you could possibly check for altitude accuracy at that time and throw out that value without having to do much extra post processing.

mpapet · 05-10-2016, 03:11 PM

Quote:

Originally Posted by astrogeek

If the samples are in linear time, once per second for example, then how much change in elevation is too much depends critically on speed.

We're talking about conventional roads here and high frequency sample rates like 1 per second. There's no road where you gain/lose 1 meter or more in just a second.

I think I'm going to do a moving average as that will catch the bad readings, then I can replace the bad reading with the moving average.

mpapet · 05-10-2016, 03:21 PM

Quote:

Originally Posted by michaelk

Based upon "GPS data in a time series" I assume it is just linear time maybe 1 sample every 1-5 seconds but I could be worng.

Is this a GPS receiver that just logs position data to a file that you download or are you recording "raw" NMEA messages. If the later is confusing then never mind. Just curious because if you were recording NMEA messages then you could possibly check for altitude accuracy at that time and throw out that value without having to do much extra post processing.

At the moment I'm using a mobile phone app that writes your track to .gpx file. Bicyclists and runners use standalone GPS devices from brands like Garmin. It's the same idea, only using my mobile phone.

I am checking altitude accuracy using USGS data. The altitude written into the .gpx file from my mobile phone is very wrong. But, checking *every* point when the file is one line per second is not really useful and time consuming. My idea is to check a few points and adjust the .gpx data to get close enough.

Eventually, I'll do the same adjustments for the occasional bad longitude/latitude values. That's a different problem though.

Thank you for your interest.

michaelk · 05-10-2016, 06:49 PM

Just as an FYI but navigation systems use a Kalman filter to get rid of outlying sensor data. Much to complicated for your project.

http://bilgin.esme.org/BitsAndBytes/...lterforDummies

rtmistler · 05-11-2016, 08:09 AM

Quote:

Originally Posted by mpapet

We're talking about conventional roads here and high frequency sample rates like 1 per second. There's no road where you gain/lose 1 meter or more in just a second.

Sure there are road conditions where you can gain/lose 1 meter in less than a second, you're not considering velocity here are you? You can sample GPS faster; however it also depends if the correlators in the device will provide you with output faster. We've used 5 Hz and 10 Hz on devices, but there are faster ones, just not commercial. I recommend something using a binary protocol like OSP or other proprietary binary protocols so you can get much more detailed information. However this all sort of falls apart because commercial GPS promises +/- 3 meter accuracy.

Quote:

Originally Posted by michaelk

Just as an FYI but navigation systems use a Kalman filter to get rid of outlying sensor data. Much to complicated for your project.

http://bilgin.esme.org/BitsAndBytes/...lterforDummies

Agreed, and whether or not you're looking directly at the messages, those messages are already past the correlators in the device. There are no GPS devices where you can see their raw data as received from the satellites, they already have their filter in the mix, because that is their product.

You might want to look into one of the highly accurate GPS devices which claim to use dead reckoning. However my caveat is that while we've "tried", two or "ten" things interfere. Firstly, the accuracy of the accelerometers on a phone is horrible and they'll want 16-bit accelerometer accuracy, in 10 dimensions. staticXYZ, magXYZ, gyroXYZ, and for your case, barometric. Secondly, all the vendors who promise this, somehow can't give me a demo/devkit board which actually works. They just tell me, "Oh, we've used the blah-blah chip with our GPS module, ... that works." and then they never answer the phone or email again. You need to calibrate the position sensors, ensure that there are no magnetic interferences (and on a road, there always are, cars/trucks, sign posts, iron content in the soil, underground pipes). And you'll need to control the temperature with enough stability of get the gyros accurate.

michaelk · 05-11-2016, 08:36 AM

Now your talking about an inertial navigation package... GPS is just an input and not the main source.

mpapet · 05-11-2016, 12:39 PM

Quote:

Originally Posted by mostlyharmless

For your particular problem, I wouldn't approach it statistically or with a filter: just do it the way you identified it. Put a limit on the physically possible. Flag any values where the change is implausible, say, greater than 0.25 meters. You can change the threshold number with some experimentation.

[edit] Strictly speaking, what I have suggested is a filter, though simple. I am a big believer in less processing of data with fewer assumptions. I did assume your data was a fixed time interval time series from your description.

This is what I'm going to do. I'm time constrained and I too prefer the simplest answer. I'm using consumer level gps devices and that's all. I'll do a moving average as that isn't hard to do with a small array. array_push/array_pop

sundialsvcs · 05-12-2016, 08:38 AM

You can also use other statistics like "Standard Deviation."

You might wish to calculate the distance (Pythagorean Theorem ...) from one point to the next, and consider whether it just had you moving several miles in a few seconds' time. If so, consider larger ranges until the distance ... the velocity of the traveler ... becomes "plausible" again. The intermediate points within that range might be outliers.