07-27-2010 10:46 PM
I did a performance comparison of "Add Array Elements" with 2 other methods (an Add Elements and the BLAS ddot function) to see which was fastest, and also compare their results, since I was observing some slight differences.
Conclusions
BLAS ddot is fastest (by a fair bit), then Add Array Elements, then the Add in a Loop
BLAS ddot and Add Array Elements give identical results almost all of the time.
I don't know why adding the elements in a loop manually gives a slightly different result. I understand there are rounding errors due to the finite precision, but why is it any different to the other 2 methods?
Solved! Go to Solution.
07-28-2010 01:52 AM
Don't forget the plain dot product from the linear algebra palette. It about the same speed as the blas version (maybe even a few % faster).
(I would also disable debugging. The loop version has a slightly larger debugging overhead, while the add array elements needs to allocate one extra large array. In my testing, the loop version is possibly slightly faster)
07-28-2010 01:54 AM
This a classic of numerical computation : the quality of the result depends on the operations order.
A simple example is the calculation of a *b / c with a,b,c equal to 25, 200 and 100 respectively. The result should be 50, but if a, b and c are U8, the result depends on the operation order, as illustrated in the example below. With DBL and non-integer data, the difference is of course much smaller, but still exists !
In your case, the difference arises also from the order of operation :
- in the add array, the product of all elements is calculated first, and then the array elements are added.
- in the loop, the add is performed after each multiply.
07-28-2010 02:04 AM
Thank you altenbach for mentioning the other dot product function. I tried it out also and yes it is just slightly faster than the BLAS ddot.
Thank you for the explanation Chilly Charly. That does make sense now.
I wonder then, which order is best? Perhaps there is no consistent answer - each is slightly wrong.
07-28-2010 02:29 AM - edited 07-28-2010 02:33 AM
pauldavey a écrit :
I wonder then, which order is best? Perhaps there is no consistent answer - each is slightly wrong.
The optimal order depends on the data 😉
I have a timing question (Altenbach ?) :
With the attached vi, where I have suppressed the wires crossing the frames, the timing difference between add array and add-in-loop is much higher (x 2). Does this means that there is some overhead transferring data across frames ?
And there is no significant difference between BLAS and dot product.
07-28-2010 03:25 AM
@chilly charly wrote:
With the attached vi, where I have suppressed the wires crossing the frames, the timing difference between add array and add-in-loop is much higher (x 2). Does this means that there is some overhead transferring data across frames ?
Interesting observation.
I take a wild guess: If you look at the buffer allocations, you see that now the buffer allocation for the second array occurs right at the sequence boundary. Could it be the allocation is now made before timing starts, falsifying the results?
(When the wires cross the frames as in the original example, the allocation occurs at the multiply node instead.)
07-28-2010 04:20 AM - edited 07-28-2010 04:25 AM
altenbach a écrit :
If you look at the buffer allocations, you see that now the buffer allocation for the second array occurs right at the sequence boundary. Could it be the allocation is now made before timing starts, falsifying the results?
(When the wires cross the frames as in the original example, the allocation occurs at the multiply node instead.)
Good catch. Buffer allocation is indeed different. But which is the falsified result ?
Edit : doing additional timing tests
07-28-2010 10:22 AM
I think the falsified result is the fast one, because it hides the cost of the buffer allocation. The size of the array is known at the time the FOR loop starts running, way before the first tick is taken.
There are additional potential flaws (but I don't think they make a real difference). In your version, some of the output subtractions can be calculated after e.g. the third and fourth frame of the sequence, and the corresponding indicators can be scheduled for updates. Since this is already happening while the later frames are still executing, they could potentially compete for CPU with the later frames, making them look slower than expected.
07-28-2010 12:15 PM
Some results just to feed the big beast here :
The tests were run on my powerbook, using WMware fusion and WindowsXP.
I have tested 4 different wirings :
-outside then inside frame
-both outside
-inside then outside
-both inside
Depending on the wiring, there are three observed timing results : 60, 112 and 120 s.
The longest run correspond to a buffer allocation at the multiply node, the intermediate timing is obtained when the buffer is allocated at the inside frame boundary, and the shortest when the allocation takes place at the external frame boundary.
At first sight, the way the buffers are allocated is not very consistent, but I expect some light from advanced users.
An additional test, with a single calculation frame has given equivalent results.