Is LabVIEW's NI_Matrix.lvlib already optimized for multi-core processors?

shoj · ‎07-01-2010

Part of my LabVIEW application involves several (~4 - 16) nonlinear least-square fits of raw data using LabVIEW's NI_Gmath.lvlib:Nonlinear Curve Fit.vi . The way the code was executing these fits originally was inside the inner loop of two nested for-loops. The index in the outer loop corresponds to some information about the x-data (or window) for the fit, and the inner index corresponds to the peak number in each window. The nested loop structure was supposed to save memory by preventing the re-processing of the x-data from each peak. So, inside of the inner loop is a call to the NI_Gmath.lvlib:Nonlinear Curve Fit.vi which is the main purpose of this VI.

Now that I have a 6-core processor to run the code on, I tried parallelizing the inner loop so it could run on multiple cores. I also tried rewriting the code a little bit: making one loop, doing some arithmetic with the index to get all the bookkeeping right, and parallelizing this one-loop structure. I found that parallelizing the inner loop in the original nested for-loop code did lead to a performance increase, but adding more cores to the single-for-loop code decreases the performance of the single-loop code. However, the single-loop code with no parallelization of the for loop executes much faster than the nested-loop code.

I ran the Performance and Memory Profiler tool to find out what was taking longer in the nested-loop code. I saw that the extra matrix manipulations (resizing, indexing, etc.) in the single-for-loop code added some execution time to the main VI that calls the fitting (36 ms versus 24 ms), but some calls to NI_Matrix.lvlib:Add - RM,RM.vi in a subVI of the fiitting VI is adding the extra time in the nested-loop code. This matrix addition VI is the most time-consuming compared to every other VI listed in the profiler. When trying to call the fitting from a nested loop, the average time spent in this matrix addition is 16 us versus 6 us in the single loop case.

My suspicion is that changing the for-loop structure and adding parallelism could be conflicting with some optimizations already built in to the NI_Matrix library--forcing the fitting to run on a single processor could be preventing some of the low-level matrix manipulations from being spread across multiple processors. Could this be the case? I've played around with using parallelized for-loops to speed up some simple arithmetic, but are there any tricks for speeding up nonlinear curve fitting in a loop? And when no parallelism is enforced, why should the nested-loop code be so much slower than the single-loop code?

Thanks for your help!

Justin_P · ‎07-02-2010

Hi,

I think that I'm on the same page as you. But, just to be sure could you post the two versions of the code that you are referring to. I would like to take a look at it and also try reproducing it on my machine.

Thanks

Justin Parker
National Instruments
Product Support Engineer

shoj · ‎07-02-2010

Here's the code. The fastest VI is the non-nested --Parallel.vi. I just contacted NI technical support a little while ago and my reference number is #7290827 for this issue. Unless I have made a silly mistake while modifying this code I'm pretty sure that the performance difference is real. Thanks for your help!

LabVIEW

Is LabVIEW's NI_Matrix.lvlib already optimized for multi-core processors?

Is LabVIEW's NI_Matrix.lvlib already optimized for multi-core processors?

Re: Is LabVIEW's NI_Matrix.lvlib already optimized for multi-core processors?

Re: Is LabVIEW's NI_Matrix.lvlib already optimized for multi-core processors?