Compiler performance: Floating-point division of array by scalar

Novgorod · ‎11-24-2023

I learned long ago that floating-point division can be (much?) slower than multiplication, so I try to avoid it, especially in loops, if it can be expressed as a mutiplication (e.g. using x*0.5 instead of x/2). Since Labview can do simple arithmetic between arrays and scalars, such as multiplying or dividing each array element by the same number, I thought these primitive functions are heavily optimized in the compiled code. For example, when an array is divided by a scalar, I expected each element is multiplied by the reciprocal instead of doing explicit division for each element - this is also implied in the way the compound arithmetic operator is represented (division is graphically displayed as multiplication by an "inverted", i.e. reciprocal, input).

However, a simple performance test shows that's apparently not the case:

The division operator is ~15% slower than taking the reciprocal of the scalar and multiplying the array by it. Is this a missed opportunity for compiler optimization or is it "by design"? If it's the latter, what's the reasoning behind it? Could both ways ever differ more than the machine epsilon? Or is it a CPU specific thing?

altenbach · ‎11-24-2023

Benchmarking is hard to do right and since you did not attach your code we cannot tell.

For example, that 1/42 will get constant folded by the compiler. Also since there in no used output, dead code elimination will turn the entire upper code into nothing, at least once debugging is disabled.

So far, all we can say is inconclusive!

LabVIEW Champion.

Novgorod · ‎11-25-2023

Sure, here's the code (LV2023), I also made it a bit more "convenient" to switch between division and multiplication. It's a simple "new vi" with default settings, i.e. debugging enabled, so it makes no difference that the output is unused. The performance difference is still the same with debugging disabled and something connected to the output (e.g. an indicator).

My question still stands why the compiler doesn't convert a division of an array by a scalar into a product of the array with the reciprocal. There's clearly a performance penalty for explicit division of each array element by the same number, so I'd like to know the reasoning behind it because surely it's a deliberate decision, right?

cordm · ‎11-25-2023

The reason is that the result is not strictly the same, since you are dealing with limited precision. It might be part of the IEEE 754 spec, not sure about that though.

GCC has an option --ffast-math that will do this transformation and also exploit things like associativity. LabVIEW does not give you access to compiler flags, so no way to change this. The most NI did up to 2016 or so was to disable SSE2 which is now enabled by default. Would be nice if they added a checkbox for AVX.

altenbach · ‎11-27-2023

Finally had a chance to use a version that can open your VI.

Using LabVIEW 2023, the difference is insignificant (debugging enabled or disabled). This is on a VM.

LabVIEW Champion.

JÞB · ‎11-27-2023

@altenbach wrote:

Finally had a chance to use a version that can open your VI.

Using LabVIEW 2023, the difference is insignificant (debugging enabled or disabled). This is on a VM.

Far be it from me to actually challenge one of Altenbach's Benchmarks 😵 but,

I'm not sure dice rolls are valid for divisor or dividend to test the speed difference between X*1/Y and X/Y The compiler knows X and Y are in the range 0<float=<1 which is an insignificant range of an IEEE 754 32bit float. Wired array Controls would eliminate the possible compiler optimization.

Of course, I can be wrong.

"Should be" isn't "Is" -Jay

Novgorod · ‎11-27-2023

Interesting.. Can it be CPU related? I tested it on Win11 with a 13900K using only P-cores for the Labview process. When enabling all cores for Labview, there is a huge jitter because the "thread director" doesn't know how to handle Labview properly which leads to P-cores and E-cores being randomly assigned on each iteration, but the difference in performance is still visible despite the jitter.

Changing the constant to a random number doesn't make a difference, and also alternating between the two modes on each iteration doesn't change the behavior (the measured time just alternates between the 2 values in the first post). I'll try to test in with other CPUs..

JÞB · ‎11-27-2023

@Novgorod wrote:

Interesting.. Can it be CPU related? I tested it on Win11 with a 13900K using only P-cores for the Labview process. When enabling all cores for Labview, there is a huge jitter because the "thread director" doesn't know how to handle Labview properly, so there's a huge jitter since P-cores and E-cores are randomly assigned on each iteration, but the difference in performance is still visible despite the jitter.

Changing the constant to a random number doesn't make a difference, and also alternating between the two modes on each iteration doesn't change the behavior (the measured time just alternates between the 2 values in the first post). I'll try to test in with other CPUs..

It might be interesting to check the VI property: Last compiled with which is influenced by the ini option for optimizations (I always set that slider to 10 "Rarely Limit" from the default 5 value)

"Should be" isn't "Is" -Jay

Novgorod · ‎11-27-2023

@JÞB wrote:
It might be interesting to check the VI property: Last compiled with which is influenced by the ini option for optimizations (I always set that slider to 10 "Rarely Limit" from the default 5 value)

The vi property says "Full compiler optimizations" and it should be all defaults. I've also now changed the optimization slider from 5 to 10 (and re-saved the vi), which didn't change the behavior in the first post.

Novgorod · ‎11-27-2023

I've tested it a bit more on some other computers and Labview versions (I attached the 2018 version) - the benchmark program is identical to the one in the first post, only the constant is replaced with a random numer (altenbach's example). I've got a performance difference on all computers I tested, regardless of Labview version, Windows version and bitness:

- A VM on the same PC (13900K) running Windows 7 and LV 2018, pretty much identical results.

- A weak 8th-gen laptop CPU (i5 8520U) on Win10 and LV 2018, same difference between modes but more jitter due to the random number generation (the difference is more clear with a constant).

- A (very) old Xeon E5-2680 v3 (dual socket) on Windows Server and LV 2023, showed by far the biggest difference (~300%!) between modes:

LabVIEW

Compiler performance: Floating-point division of array by scalar

Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar

Re: Compiler performance: Floating-point division of array by scalar