Annoyingly inconsistent execution times for averaging operation

Kyle97330 · ‎05-22-2025

Part of an application I am developing at the moment means that I need to take a large-ish data sample (4.2 million 12-bit samples) and get the average value. It's the first part of a process where the first part is to increase the total energy found in a sample by manipulating some inputs. I'm abstracting this partly because it's not important but it is also company-confidential. So, just imagine a process like this:

Set up environment

Start loop {

Take sample

Determine energy contained in sample

Compare to previous energy values. If a maximum appears to be reached, exit loop

Attempt to intelligently alter environment slightly in a way that increases energy

Repeat loop with new environment }

We may need to iterate through this loop thousands of times so processing time is important... every 1 ms of time we cut from the calculations could save us 1-10s each time we need to run the sequence, and we need to do the sequence enough times per day that the time savings could be meaningful when it goes into full production.

The "Take sample" portion is handled by a 3rd party DLL that returns a 1D array of U16 to LabVIEW. That part isn't something that can change. The bold part is what I'm trying to make go as fast as possible. I need to determine the total energy contained in the samples. I am aware that I could do some decimation of the samples but at the moment we are trying to avoid that in case there are cyclic patterns that we could miss if some level of decimation was applied.

Because the data comes in as U16 it's not possible to just do a simple "Add array elements" as it quickly overflows. The data needs to either be added together in a DBL, I64, or U64 to accommodate the largest value that could occur (2^34, since there are 2^22 samples that could each contain a value as high as 2^12-1)

I've attached a VI I created (back-saved to 2018, but I am in 2021 if that matters) to do time trials with 7 different methods, with the case that is consistently the fastest just being a simple FOR loop, as shown here in this case:

On my dev PC this completes in around 7 ms. (The PC that will run this software live is much newer/faster, but I'm just trying to improve it relative to execution times on the same PC).

One of the alternatives I tried was splitting the array so it would perhaps run on 2 cores to bring in some parallelization:

This, along with another option that uses Decimation, usually end up taking a lot longer (18-20 ms).

But the main reason I am posting here is that on a few occasions that I have been unable to repeat consistently, the "split" cases where I divide the array would instead run in 4 ms or so, nearly a 50% reduction on the best time I can achieve. This would happen a few times in a row, usually after not changing anything or after changing something in one of the other cases, but as soon as I would try to copy the code or neaten it up before copying it it would instead go back to being (relatively) slow at the previous 18-20 ms.

So the goals here are either:

Determine what makes the split-array + parallelization work quicker sometimes, and make it repeatable, OR
Find another calculation method that's faster (but doesn't rely on any form of decimation or sample size reduction)

Thanks!

altenbach · ‎05-22-2025

Don't forget that you can parallelize your "adding loop".

Even though there is a shift register, the compiler recognized the pattern and parallelizes it just fine. Faster that any of your alternatives on my laptop. (similar for U32, I32, or DBL)

This is efficient because there is no new allocation of the entire input array in another datatype like in some of your other attempts..

LabVIEW Champion.

Andrey_Dmitriev · ‎05-22-2025

@Kyle97330 wrote:

Find another calculation method that's faster (but doesn't rely on any form of decimation or sample size reduction)

If I were to attack a similar problem, I would probably do something like this:

#include "MeanAVX.h"
#include <immintrin.h>

MEANAVX_API double Mean(TD1Hdl arg1) {
    TD1* array = *arg1;
    int32_t dimSize = array->dimSize;
    uint16_t* data = array->elt;

    if (dimSize == 0) return 0.0;

    uint64_t total_sum = 0;
    int i = 0;

    // Process in blocks to prevent 32-bit accumulator overflow
    const int block_size = 65536; // Max elements per block (sum <= 2^32)
    while (i < dimSize) {
        int block_end = i + block_size;
        if (block_end > dimSize) block_end = dimSize;
        __m256i sum_lo = _mm256_setzero_si256();
        __m256i sum_hi = _mm256_setzero_si256();
        // Vectorized summation within the block
        int j;
        for (j = i; j + 15 < block_end; j += 16) {
            __m256i chunk = _mm256_loadu_si256((__m256i*) & data[j]);
            __m128i low = _mm256_castsi256_si128(chunk);
            __m128i high = _mm256_extractf128_si256(chunk, 1);

            // Convert 16-bit to 32-bit and accumulate
            sum_lo = _mm256_add_epi32(sum_lo, _mm256_cvtepu16_epi32(low));
            sum_hi = _mm256_add_epi32(sum_hi, _mm256_cvtepu16_epi32(high));
        }
        // Horizontal sum of 32-bit accumulators
        uint32_t lo_parts[8], hi_parts[8];
        _mm256_storeu_si256((__m256i*)lo_parts, sum_lo);
        _mm256_storeu_si256((__m256i*)hi_parts, sum_hi);
        uint32_t block_sum = 0;
        for (int k = 0; k < 8; k++) {
            block_sum += lo_parts[k] + hi_parts[k];
        }
        // Add remaining elements in scalar loop
        for (; j < block_end; j++) {
            block_sum += data[j];
        }
        total_sum += block_sum;
        i = block_end;
    }
    return (double)total_sum / dimSize;
}

Less than millisecond on my ancient i7-4940MX and faster than any other:

Downgraded to LV2020 + Visuall Studio 2022, code in the attachment, 64-bit only.

Kyle97330 · ‎05-22-2025

@altenbach wrote:

Don't forget that you can parallelize your "adding loop".

Even though there is a shift register, the compiler recognized the pattern and parallelizes it just fine. Faster that any of your alternatives on my laptop. (similar for U32, I32, or DBL)

This is efficient because there is no new allocation of the entire input array in another datatype like in some of your other attempts..

I learned about parallelization of FOR loops a long time ago and I could have sworn that a shift register completely disabled parallelization, unconditionally. Always good to be learning that there are exceptions.

This actually does work really well (reliably under 1 ms on my system). I can't remember if I tried this at any point but it has the key differences that it's both using a DBL instead of a U64 and also adding the parallelism. Using just one or just the other does not help.

I still don't know why I got the occasional lower values on the other attempted methods, but with this method cutting the time down much farther than those other ones ever did, I'm no longer invested in caring.

Andrey_Dmitriev, that is an impressive block of code to drop, but I think Altenbach's solution is what I'll go with to avoid unnecessarily bringing Visual Studio and DLLs into this.

mcduff · ‎05-22-2025

Inspired by Andrey_Dmitriev, below is slightly faster on my computer, ~.2-.3ms, than the parallel add. Break into parallel chunks, add in U32, then convert to double.

altenbach · ‎05-23-2025

Is there a reason that the array is in a shift register. A plain tunnel seems equivalent.

(Posting by phone, cannot test).

LabVIEW Champion.

Andrey_Dmitriev · ‎05-23-2025

@Kyle97330 wrote:.

Andrey_Dmitriev, that is an impressive block of code to drop, but I think Altenbach's solution is what I'll go with to avoid unnecessarily bringing Visual Studio and DLLs into this.

Thank you!

From what I see, the single-threaded DLL is still faster than the parallel loop (which, of course, depends on the number of cores used). However, I can also put it into a parallel loop, and in that case, the LabVIEW code stands no chance of competing.

This is a common issue — not just limited to LabVIEW: the poor performance of modern software is often masked by multithreading on modern multi-core CPUs. Typically, I focus on optimizing the single-threaded version first. Once it's performing well, I consider running it in parallel — if really necessary and feasible.

When intensive computations are performed using native LabVIEW code on native LabVIEW arrays, the efficiency is quite low (depends on algorithm). It might only utilize 10–20% of the CPU’s potential — not in terms of CPU load, which may be close to 100%, but in terms of actual CPU's instruction throughput per loop iteration.

On the other hand, premature, unnecessary, or overly aggressive optimization is the root of all evil.

mcduff · ‎05-23-2025

@altenbach wrote:

Is there a reason that the array is in a shift register. A plain tunnel seems equivalent.

(Posting by phone, cannot test).

Was faster on my computer with the shift register. Don't understand why. I was testing on an old computer.

altenbach · ‎05-23-2025

I gained 20% just by changing my windows setting from "balanced" to "best performance". 😄

LabVIEW Champion.

mcduff · ‎05-23-2025

This one reduced the adding loop a bit more. One my computer the parallel adding loop had an average ~1.4ms, this one is typically just under 1ms, ~.9ms. No shift registers needed.

LabVIEW

Annoyingly inconsistent execution times for averaging operation

Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation

Re: Annoyingly inconsistent execution times for averaging operation