12-03-2010 11:27 AM
hello,
I understand that rounding errors exist and I need help determining the best way to handle them. The particular situation I am dealing with relates to finding the equation of a line and comparing the calculated values to a data set.
The big picture: I have a data set that looks generally exponential with sharp peaks along the curve. The exponential part is background noise that I want to eliminate. After trying various methods, I've determined that getting a convex hull is the best method of eliminating the background noise. The convex hull is found by finding a line through point A and point B of the data and determining if all of the data is above (greater than) the line (similar to a tangent line). Once a point B is found which satisfies the condition, point B becomes the new point A and a new point B is looked for.
I compare the data set to the points that create the line. The X values of the data set and the line are the same integer values. The Y values of the data set are DBL. I have attached a VI demonstrating the simple calculations I am doing to find the Y values along the line. It is simply the (y1-y0)/(x1-x0) to get the slope and using the slope to find the y-intercept (b = y0-m*x0). I then use the slope and y-intercept to calculate the Y values along the line. I subtract the line Y values from the data set Y values and check if they are all positive. If they are, then I can conclude that all the points are above the line and the point B (x1,y1) is kept as part of the convex hull.
The problem is that rounding error is causing "false negatives." In other words, the rounding error is causing the actual and calculated (through equation of line) y1 to be different. Since (x1, y1) is a data set point that defined the line it should be equal to the calculated (x1, y1), but the rounding error is not allowing that.
Currently, I am adding a very small (1.5E-12) constant to the data set point to account for the error and prevent false negatives (which would be the existence of a negative difference, when it should be 0). However, this small number is not always sufficient.
How can I find the maximum error, so that I can add the error to the data set point and therefore always prevent false negatives?
Thanks!
Kristen
12-03-2010 11:42 AM
It seems like there may be a better way. Can you show us an example of a typical data set? How many peaks? What is the range of the data (how many orders of magnitude)? Do you know the characteristics of the exponential? That is, do you have a theoretical model of the underlying process?
Lynn
12-03-2010 02:38 PM - edited 12-03-2010 02:39 PM
Hi Lynn,
Thanks for your help.
I have attached a text file with example data set. The range is 0-6000 (look around x=187 and x=300 for max). There is a middle region (220- 260ish) that is irrelevant, otherwise, there are eight peaks in total, four peaks on each side, which are symmetric from the center. I am doing the convex hull on each half of the data. The data set actually begins as an x-ray diffraction image (irrelevant region is the backstop preventing the x-ray from saturating the detector so only the diffraction is captured) of which I take an ROI and sum down the columns. The result is this data which is pixel location and summed pixel values. (Yes, I know the x values in the text file are not integers- this is because some of the end points and the irrelevant area is extracted out and the ramp function is used to create new x values even spaced between the new beginning and end of the data set. This part of the code is inherited and I need to check with the programmer why this method was used)
Previously, I tried doing a curve fit of the whole data set (background not subtracted) with the background described as h0+h1*exp(-h2*(x-r))+h3*exp(-h4*(x-r)) and the peaks described as a convolution of a gaussian and lorentzian peak: abs(a0)*((1/(1+f^2))*exp(-((x-(r+d0))/w)^2)+(1-(1/(1+f^2)))/(1+((x-(r+d0))/w)^2))
where
r is the theoretical median of the data (point about which the image is horizontally symmetric)
a0 is the amplitude of the peak
d0 is the distance from symmetric center to the center of the peak
w is the full width half max of the peak
f is a factor to determine if the peak is more Gaussian or Lorentzian
I added the background model with the peak models (subtract d0 from r for the symmetrically equivalent peak) and used the Non-linear Curve Fit VI to fit the equation. The results were decent, but the convex hull -> subtract background -> peak fit method seemed to be fitting the peaks better (correlation .999 vs .99), which is the important part.
Thoughts?
Kristen
12-06-2010 02:47 PM
Hello Kristin,
You could try multiplying the X and Y values by a large number and then proceed with your calculation. This will remove the rounding errors to start.
With regards to removing the peaks from your data, have you looked at the Peak Detector VI? This may be able to highlight those peaks by using a threshold. From there you can determine the locations and proceed as you wish.
Best of luck!
12-06-2010 04:01 PM
Kristen,
Because of the symmetry it appears that splitting the data at the midpoint, reversing one part and adding them together improves the signal to noise ratio. The peaks add and the random part of the background tends to cancel. Then you have only a single exponential in the background.
Lynn
12-07-2010 01:12 PM - edited 12-07-2010 01:13 PM
Hi Lynn,
I agree this certainly does improve the SNR, and this method would work great for finding the intensity of the peaks. However, I spoke with the scientist running this investigation and he brought up a good (and unfortunate) point. We are interested in measuring the position of the peaks, and, in theory, the peaks should be symmetric about the "True Center," but the Center in the image may shift due to pixelation. The folding/averaging would reduce the sensitivity of that measurement.
Also, I stumbled across this (An Introduction to Floating-Point Behavior in LabVIEW) document on NI's website. I thought I would post the link so others looking for more information may find it too.
~Kristen
12-08-2010 07:29 AM
Kristen,
If you seek subpixel resolution on the peak locations that would not work. With most peaks having only two or thee data points above baseline, I think you are going to have problems due to the amount of noise in the system.
How much data do you need to analyze? Is it feasible to manually adjust the baseline correction rather than trying to completely automate it?
Lynn
12-08-2010 07:54 AM
Hi Lynn,
The raw data is 488x196 images. There are 15 folders each representing an experiment. Within each folder there are anywhere from 400-600 images. The images are averaged down to 32 frames (the experiment has a cycle in it that is 32 frames in duration. Frame t, frame t+32, frame t+64, etc are averaged together to give the averaged cycle with only 32 frames. So... at minimum there are 15x32 = 480 images to be analyzed. (the data set I provided is x= horz. pixel location, y = sum of pixel values at x within a ROI).
Yeah, I realize the noise and the fact that only a couple points define the peak is problematic. Unfortunately, the hardware is the limiting factor there and we're trying to get as much information out of the data we have- at least some reliable trends, not necessarily exact measurements.
Thanks for your help!
Kristen
12-08-2010 03:34 PM
I figured I would have had a chance to do some testing by now, but that hasn't worked out so prepare for some hypotheses that may or may not be accurate.
1) If you are making pretty pictures, a heuristic method like the convex hull is perfectly acceptable. For data analysis you would like to have a solid basis for understanding the uncertainties, and that is where an established curve fitting method comes in handy. Without going through the calculations I can estimate that your accuracy with the convex hull will be approximately that of the standard finite difference method for finding a derivative, namely the square root of machine epsilon, or about 10^-8 (you can test it). That should bound your maximum error. If it sounds like a relatively large number, that is one reason that straight finite differencing should be avoided when possible.
2) Looking at the data in Lynn's post, and having seen many, many different curves in my day, my eyes are telling me that you have a Lorentzian background. It may seem like an exponential, but it has a slower fall-off (more power in the wings, as we say). I bet if you try the combined peaks+background fit with a Lorentzian background instead of an exponential you will get better results. Or just fit to the Lorentzian background to remove it. You might just pick up that extra 9 in your correlation. In a perfect world, the experimental configuration would give you a reason to expect a Lorentzian background.
3) Don't worry too much about the number of points in your peak, the eyes are typically a bad judge of statistical significance. The numbers will tell the story.