05-26-2025 09:14 AM - edited 05-26-2025 09:16 AM
@mcduff wrote:
This one reduced the adding loop a bit more. One my computer the parallel adding loop had an average ~1.4ms, this one is typically just under 1ms, ~.9ms. No shift registers needed.
Isn't it faster to Sum array on the integers and convert the result to DBL? Right, U16 ... then you'd need 2 conversions, might be slower in total.
05-26-2025 09:32 AM
@Yamaeda wrote:
@mcduff wrote:
This one reduced the adding loop a bit more. One my computer the parallel adding loop had an average ~1.4ms, this one is typically just under 1ms, ~.9ms. No shift registers needed.
Isn't it faster to Sum array on the integers and convert the result to DBL? Right, U16 ... then you'd need 2 conversions, might be slower in total.
You need to convert the array first to avoid a potential overflow of the U16 datatype. You could potentially convert to a different integer type instead, U32 or U64. It should be noted there are many ways for optimization. This one was faster than the parallel add on my computer, but it used more memory. The parallel add is nice since its fast and uses less memory.
05-28-2025 05:57 AM
@Kyle97330 wrote:
@altenbach wrote:
Don't forget that you can parallelize your "adding loop".
Even though there is a shift register, the compiler recognized the pattern and parallelizes it just fine. Faster that any of your alternatives on my laptop. (similar for U32, I32, or DBL)
This is efficient because there is no new allocation of the entire input array in another datatype like in some of your other attempts..
I learned about parallelization of FOR loops a long time ago and I could have sworn that a shift register completely disabled parallelization, unconditionally. Always good to be learning that there are exceptions.
This actually does work really well (reliably under 1 ms on my system). I can't remember if I tried this at any point but it has the key differences that it's both using a DBL instead of a U64 and also adding the parallelism. Using just one or just the other does not help.
I still don't know why I got the occasional lower values on the other attempted methods, but with this method cutting the time down much farther than those other ones ever did, I'm no longer invested in caring.
Andrey_Dmitriev, that is an impressive block of code to drop, but I think Altenbach's solution is what I'll go with to avoid unnecessarily bringing Visual Studio and DLLs into this.
Usually, a shift register forces a sequential order to process iterations and therefore can't be used with Parallelization P in For-Loop.
But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2
Instead of strictly processing each iteration sequentially, LabVIEW can distribute parts of the summation across multiple threads and then merge the partial sums at the end.
nice to know..!
05-28-2025 08:29 AM
@alexderjuengere wrote:
But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2
In this particular case of integers yes, of course, but in general associativity fails in floating-point arithmetic, it is quite simple to demonstrate:
05-28-2025 08:46 AM
@Andrey_Dmitriev ha scritto:
@alexderjuengere wrote:
But when calculating a sum, the order to process iterations is not of concern e.g. 1+2+3 = 1+3+2
In this particular case of integers yes, of course, but in general associativity fails in floating-point arithmetic, it is quite simple to demonstrate:
True, but you can't tell what is the "correct" result. Either they are practically the same or you need to setup a different technique to get the sum.
05-28-2025 09:17 AM - edited 05-28-2025 09:18 AM
True, for DBL the order of operations that slightly alter the result, but both results are within the limits if the datatype for typica data. For pathological data, the results can differ dramatically, for example if the first billion elements are 1 and the last is gigantic, reversing the order, the gigantic number will be the result unchanged because x+1=x for all additions..
Note that even for binary identical inputs, the parallel FOR loop might give slightly different results between runs.
This is a difficult problem to optimize, for example one might think that the inner FOR loop could be tuned to operate on the 16bit numbers without overflow, But that's not faster.
While summing in a shift register is "in place", autoindexing followed by the array sum can take better advantage of SIMD instructions at the cost of the array allocation. Further tuning might even consider the CPU cache sizes and CPU architecture. Yes, we can gain a few percent in speed by throwing much more code at it, but that also offers more places for bugs to hide.
Going down these rabbit holes might give a fantastic result, only to be worse on a different machine.
Unless we gain a factor of two, I always stay with the simplest solution, which in this case is my single parallel FOR loop. 😄
05-28-2025 10:31 AM
@altenbach wrote:
Unless we gain a factor of two, I always stay with the simplest solution, which in this case is my single parallel FOR loop. 😄
Fully agree. Unfortunately, it's not always easily possible. For example, if I have an array in a Shift Register, LabVIEW will understandably warn me: 'A For Loop is configured for parallel execution, and more than one iteration of the For Loop might access the same array element, with at least one potentially writing to it. Remove the dependence between loop iterations or disable parallelism on the For Loop.'
The problem is that, as the developer, I know — there are no overlapping writes. But how can I tell and explain that to LabVIEW and parallelize this part of the code in an elegant way to improve performance?
05-28-2025 11:59 AM - edited 05-28-2025 12:35 PM
You can parallelize the heavy lifting and stuff the turkey later.
On my rig: Your code 133ms, Code below: 25ms (same result!)
(Sorry, I reuploaded the attachment. The original had some garbage in it...)
05-28-2025 12:14 PM - edited 05-28-2025 02:03 PM
On a side note, it might help to write a median that operates directly in U16. Very long ago, we had the median coding challenge (~LabVIEW 7.0) and "quickselect" is not that hard to implement. Maybe I still have some old code somewhere... 😄
05-28-2025 12:44 PM
I also noticed that my code is consistently about 10% faster if I insert an "always copy" as shown. Hard to tell what's going on.... or if it is even real. 😄