Linear Fits

CraigGraham · ‎07-18-2001

I have some data that should follow a straight line, but there are spikes in
it. Rather than manually editing the data to remove the spikes, which is
tedious and time consuming, I wanted a function that would intelligently
decide which datapoints were bad and zap them.

Initially I thought this would be trivial; do the fit, calculate residuals
and zap all data lying further than one standard deviation from the line.
However, the spikes are sufficiently large that they can skew the linear
fit, and hence the determination of which points to drop. Doing a threshold
zap on the raw data isn't adequate since the values of the data aren't
sufficiently consistent between sets.

I tried differentiating the residuals- and in another attempt the straight
line data- reasoning that the differential should be constant everywhere
other than around the spike, and a simple thresholding would then tell me
which data to kill. Again no success; the data either side of the spike are
far from the constant line, the data of the spike itself are not.

The approach I've settled with is to do the linear fit on all the data, get
the MSE, zap the first datapoint and get the MSE, replace the first and zap
the second and so on until I have a table showing the effect of removing
each individual datapoint on the overall fit quality. I then find the most
significant datapoint, see if the effect of removing that gives a percentage
decrease in the MSE better than a user defined threshold, and then repeat
the process until either there are no more significant gains to be made or a
limit for the percentage of data that can be dropped is reached. To avoid
zapping down to two datapoints, I've chosen 50% as the maximum number of
points to lose and 50% as the increase in fit quality required to justify
the loss of a data point.

This works well, but it means that if there are n datapoints, n linear fits
must be carried out for each data point dropped. Execution time is therefore
going to increase exponentially with the data size, which always worries me
slightly.

One thing I've already thought of is the possibility of zapping more than
one point per iteration; however the fit changes so much when a single
outlier is removed that I suspect valid data may be lost this way.

Does anyone have any other suggestions to zap erroneous data without zapping
good data? This technique is fine for small datasets, and I guess the
threshold-based zapping is fine for large datasets, where the loss of a few
points changes the fit only very little, but if I ever need to do this on
data in the intermediate region I might have to start playing about again.

--
Craig Graham
Physicist/Labview Programmer
Lancaster University, UK

Ben · ‎07-19-2001

Hi Craig,

Take the derivative and run it through SPC.

Use the out of control points to indicate which points to ZAP.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

Enrique · ‎07-19-2001

I have a question about your approach:

Does "ZAP" mean that you are going to completely "delete" the outlier point or to "modify" it? Be advise that if you completely delete an outlier, your resulting data will be shifted to the right from that point, and thus your resulting data will no longer resemble a straight line.

www.vartortech.com

Enrique · ‎07-19-2001

I mean, will be shifted to the left 🙂

www.vartortech.com

CraigGraham · ‎07-20-2001

Enrique wrote in message
news:50650000000500000041370000-993342863000@exchange.ni.com...
> I mean, will be shifted to the left 🙂

Not an issue; the data have both X and Y values; delete a datapoint and all
the other points stay in the same place.

I'm confused about the first reply in this thread; what is "SPC" here?

johnnason · ‎06-18-2002

Graig

SPC : statistic process control software

good luck!

Johnnason

gnunes · ‎07-27-2001

I would go through the data fitting a line to pairs of adjacent points. You should get a concensus slope except for pairs that include outliers. Average all the slopes and calculate the std. dev. NOW you can use your one-std-dev idea to discard points.

LabVIEW

Linear Fits

Linear Fits

Re: Linear Fits

Re: Linear Fits

Re: Linear Fits

Re: Linear Fits

Re: Linear Fits

Re: Linear Fits