LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Annoyingly inconsistent execution times for averaging operation

Solved!
Go to solution

@mcduff wrote:

This one reduced the adding loop a bit more. One my computer the parallel adding loop had an average ~1.4ms, this one is typically just under 1ms, ~.9ms. No shift registers needed.

 

snip.png


Isn't it faster to Sum array on the integers and convert the result to DBL? Right, U16 ... then you'd need 2 conversions, might be slower in total.

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems
Certified-LabVIEW-Developer
0 Kudos
Message 11 of 23
(1,211 Views)

@Yamaeda wrote:

@mcduff wrote:

This one reduced the adding loop a bit more. One my computer the parallel adding loop had an average ~1.4ms, this one is typically just under 1ms, ~.9ms. No shift registers needed.


Isn't it faster to Sum array on the integers and convert the result to DBL? Right, U16 ... then you'd need 2 conversions, might be slower in total.


You need to convert the array first to avoid a potential overflow of the U16 datatype. You could potentially convert to a different integer type instead, U32 or U64. It should be noted there are many ways for optimization. This one was faster than the parallel add on my computer, but it used more memory. The parallel add is nice since its fast and uses less memory.

Message 12 of 23
(1,203 Views)

@Kyle97330 wrote:

@altenbach wrote:

Don't forget that you can parallelize your "adding loop".

 

Even though there is a shift register, the compiler recognized the pattern and parallelizes it just fine. Faster that any of your alternatives on my laptop. (similar for U32, I32, or DBL)

 

altenbach_0-1747942740721.png

 

This is efficient because there is no new allocation of the entire input array in another datatype like in some of your other attempts..

 


I learned about parallelization of FOR loops a long time ago and I could have sworn that a shift register completely disabled parallelization, unconditionally.  Always good to be learning that there are exceptions.

 

This actually does work really well (reliably under 1 ms on my system).  I can't remember if I tried this at any point but it has the key differences that it's both using a DBL instead of a U64 and also adding the parallelism.  Using just one or just the other does not help.

 

I still don't know why I got the occasional lower values on the other attempted methods, but with this method cutting the time down much farther than those other ones ever did, I'm no longer invested in caring.

 

Andrey_Dmitriev, that is an impressive block of code to drop, but I think Altenbach's solution is what I'll go with to avoid unnecessarily bringing Visual Studio and DLLs into this.


Usually, a shift register forces a sequential order to process iterations and therefore can't be used with Parallelization P in For-Loop.

But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2

 

Instead of strictly processing each iteration sequentially, LabVIEW can distribute parts of the summation across multiple threads and then merge the partial sums at the end.

 

nice to know..!

0 Kudos
Message 13 of 23
(1,154 Views)

@alexderjuengere wrote:

But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2

 


In this particular case of integers yes, of course, but in general associativity fails in floating-point arithmetic, it is quite simple to demonstrate:

 

ass.png

0 Kudos
Message 14 of 23
(1,136 Views)

@Andrey_Dmitriev  ha scritto:

@alexderjuengere wrote:

But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2

 


In this particular case of integers yes, of course, but in general associativity fails in floating-point arithmetic, it is quite simple to demonstrate:

 

ass.png


True, but you can't tell what is the "correct" result. Either they are practically the same or you need to setup a different technique to get the sum.

Paolo
-------------------
LV 7.1, 2011, 2017, 2019, 2021
0 Kudos
Message 15 of 23
(1,128 Views)

True, for DBL the order of operations that slightly alter the result, but both results are within the limits if the datatype for typica data. For pathological data, the results can differ dramatically, for example if the first billion elements are 1 and the last is gigantic, reversing the order, the gigantic number will be the result unchanged because x+1=x for all additions..

 

Note that even for binary identical inputs, the parallel FOR loop might give slightly different results between runs.

 

This is a difficult problem to optimize, for example one might think that the inner FOR loop could be tuned to operate on the 16bit numbers without overflow, But that's not faster.

 

While summing in a shift register is "in place", autoindexing followed by the array sum can take better advantage of SIMD instructions at the cost of the array allocation. Further tuning might even consider the CPU cache sizes and CPU architecture. Yes, we can gain a few percent in speed by throwing much more code at it, but that also offers more places for bugs to hide.

 

Going down these rabbit holes might give a fantastic result, only to be worse on a different machine.

 

Unless we gain a factor of two, I always stay with the simplest solution, which in this case is my single parallel FOR loop. 😄

 

 

0 Kudos
Message 16 of 23
(1,118 Views)

@altenbach wrote:

 

Unless we gain a factor of two, I always stay with the simplest solution, which in this case is my single parallel FOR loop. 😄

 

 


Fully agree. Unfortunately, it's not always easily possible. For example, if I have an array in a Shift Register, LabVIEW will understandably warn me: 'A For Loop is configured for parallel execution, and more than one iteration of the For Loop might access the same array element, with at least one potentially writing to it. Remove the dependence between loop iterations or disable parallelism on the For Loop.'

Screenshot 2025-05-28 17.24.06.png

The problem is that, as the developer, I know — there are no overlapping writes. But how can I tell and explain that to LabVIEW and parallelize this part of the code in an elegant way to improve performance?

0 Kudos
Message 17 of 23
(1,097 Views)

You can parallelize the heavy lifting and stuff the turkey later.

 

On my rig: Your code 133ms, Code below: 25ms (same result!)

 

altenbach_2-1748451502456.png

 

 

 

altenbach_0-1748451355349.png

 

(Sorry, I reuploaded the attachment. The original had some garbage in it...)

 

Message 18 of 23
(1,080 Views)

On a side note, it might help to write a median that operates directly in U16. Very long ago, we had the median coding challenge (~LabVIEW 7.0) and "quickselect" is not that hard to implement. Maybe I still have some old code somewhere... 😄

0 Kudos
Message 19 of 23
(1,065 Views)

I also noticed that my code is consistently about 10% faster if I insert an "always copy" as shown. Hard to tell what's going on.... or if it is even real. 😄

 

altenbach_0-1748454184187.png

 

0 Kudos
Message 20 of 23
(1,055 Views)