LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Comparison of performance and sum result for Summing Array Elements

Solved!
Go to solution

I did a performance comparison of "Add Array Elements" with 2 other methods (an Add Elements and the BLAS ddot function) to see which was fastest, and also compare their results, since I was observing some slight differences.

 

Conclusions
BLAS ddot is fastest (by a fair bit), then Add Array Elements, then the Add in a Loop

BLAS ddot and Add Array Elements give identical results almost all of the time.

 

I don't know why adding the elements in a loop manually gives a slightly different result. I understand there are rounding errors due to the finite precision, but why is it any different to the other 2 methods?

Download All
0 Kudos
Message 1 of 9
(4,278 Views)

Don't forget the plain dot product from the linear algebra palette. It about the same speed as the blas version (maybe even a few % faster).

 

(I would also disable debugging. The loop version has a slightly larger debugging overhead, while the add array elements needs to allocate one extra large array. In my testing, the loop version is possibly slightly faster)

 

 

0 Kudos
Message 2 of 9
(4,257 Views)
Solution
Accepted by pauldavey

This a classic of numerical computation : the quality of the result depends on the operations order.

 

A simple example is the calculation of a *b / c   with a,b,c  equal to 25, 200 and 100 respectively. The result should be 50, but if a, b and c are U8, the result depends on the operation order, as illustrated in the example below. With DBL and non-integer data, the difference is of course much smaller, but still exists !

 

20071iC07C11C796139052

 

20073i9B08DAA0A6214317

 

In your case, the difference arises also from the order of operation :

- in the add array, the product of all elements is calculated first, and then the array elements are added.

- in the loop, the add is performed after each multiply.

 

Chilly Charly    (aka CC)
Message 3 of 9
(4,255 Views)

Thank you altenbach for mentioning the other dot product function. I tried it out also and yes it is just slightly faster than the BLAS ddot.

 

Thank you for the explanation Chilly Charly. That does make sense now.

 

I wonder then, which order is best?  Perhaps there is no consistent answer - each is slightly wrong.

0 Kudos
Message 4 of 9
(4,245 Views)

 


pauldavey a écrit :

I wonder then, which order is best?  Perhaps there is no consistent answer - each is slightly wrong.


The optimal order depends on the data 😉

 

 

I have a timing question (Altenbach ?) :

With the attached vi, where I have suppressed the wires crossing the frames, the timing difference between add array and add-in-loop is much higher (x 2). Does this means that there is some overhead transferring data across frames ?

And there is no significant difference between BLAS and dot product.

 

Chilly Charly    (aka CC)
0 Kudos
Message 5 of 9
(4,240 Views)

 


@chilly charly wrote:

 

With the attached vi, where I have suppressed the wires crossing the frames, the timing difference between add array and add-in-loop is much higher (x 2). Does this means that there is some overhead transferring data across frames ? 


 

Interesting observation.

 

I take a wild guess: If you look at the buffer allocations, you see that now the buffer allocation for the second array occurs right at the sequence boundary. Could it be the allocation is now made before timing starts, falsifying the results?

 

(When the wires cross the frames as in the original example, the allocation occurs at the multiply node instead.)

Message 6 of 9
(4,230 Views)

 


altenbach a écrit :

If you look at the buffer allocations, you see that now the buffer allocation for the second array occurs right at the sequence boundary. Could it be the allocation is now made before timing starts, falsifying the results?

 

(When the wires cross the frames as in the original example, the allocation occurs at the multiply node instead.)


Good catch. Buffer allocation is indeed different. But which is the falsified result ? 

 

Edit : doing additional timing tests

Chilly Charly    (aka CC)
0 Kudos
Message 7 of 9
(4,220 Views)

I think the falsified result is the fast one, because it hides the cost of the buffer allocation. The size of the array is known at the time the FOR loop starts running, way before the first tick is taken.

 

There are additional potential flaws (but I don't think they make a real difference). In your version, some of the output subtractions can be calculated after e.g. the third and fourth frame of the sequence, and the corresponding indicators can be scheduled for updates. Since this is already happening while the later frames are still executing, they could potentially compete for CPU with the later frames, making them look slower than expected.

0 Kudos
Message 8 of 9
(4,196 Views)

Some results just to feed the big beast here :

The tests were run on my powerbook, using WMware fusion and WindowsXP.

I have tested 4 different wirings :

-outside then inside frame

-both outside

-inside then outside

-both inside

 

Depending on the wiring, there are three observed timing results : 60, 112 and 120 s.

 

The longest run correspond to a buffer allocation at the multiply node, the intermediate timing is obtained when the buffer is allocated at the inside frame boundary, and the shortest when the allocation takes place at the external frame boundary.

 

At first sight, the way the buffers are allocated is not very consistent, but I expect some light from advanced users.

An additional test, with a single calculation frame has given equivalent results.

 

20155i65B15CFE14E67264

20135i3CBDE3409889E46E20137iA3845E90D941A76620153i862B750C963098F4

 

 

Chilly Charly    (aka CC)
0 Kudos
Message 9 of 9
(4,185 Views)