I have some data that should follow a straight line, but there are spikes in
it. Rather than manually editing the data to remove the spikes, which is
tedious and time consuming, I wanted a function that would intelligently
decide which datapoints were bad and zap them.
Initially I thought this would be trivial; do the fit, calculate residuals
and zap all data lying further than one standard deviation from the line.
However, the spikes are sufficiently large that they can skew the linear
fit, and hence the determination of which points to drop. Doing a threshold
zap on the raw data isn't adequate since the values of the data aren't
sufficiently consistent between sets.
I tried differentiating the residuals- and in another attempt the straight
line data- reasoning that the differential should be constant everywhere
other than around the spike, and a simple thresholding would then tell me
which data to kill. Again no success; the data either side of the spike are
far from the constant line, the data of the spike itself are not.
The approach I've settled with is to do the linear fit on all the data, get
the MSE, zap the first datapoint and get the MSE, replace the first and zap
the second and so on until I have a table showing the effect of removing
each individual datapoint on the overall fit quality. I then find the most
significant datapoint, see if the effect of removing that gives a percentage
decrease in the MSE better than a user defined threshold, and then repeat
the process until either there are no more significant gains to be made or a
limit for the percentage of data that can be dropped is reached. To avoid
zapping down to two datapoints, I've chosen 50% as the maximum number of
points to lose and 50% as the increase in fit quality required to justify
the loss of a data point.
This works well, but it means that if there are n datapoints, n linear fits
must be carried out for each data point dropped. Execution time is therefore
going to increase exponentially with the data size, which always worries me
slightly.
One thing I've already thought of is the possibility of zapping more than
one point per iteration; however the fit changes so much when a single
outlier is removed that I suspect valid data may be lost this way.
Does anyone have any other suggestions to zap erroneous data without zapping
good data? This technique is fine for small datasets, and I guess the
threshold-based zapping is fine for large datasets, where the loss of a few
points changes the fit only very little, but if I ever need to do this on
data in the intermediate region I might have to start playing about again.
--
Craig Graham
Physicist/Labview Programmer
Lancaster University, UK