LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Got some interesting benchmarks on To Upper vs. "Branchless Programming"

Here's using the smallest possible map (60 entries) while returning the input unchanged where uppercase does not exist.

 

altenbach_2-1762527826575.png

 

Even though the map generation itself gets constant folded, the array LUT is faster.

 

 

0 Kudos
Message 11 of 19
(821 Views)

You can't predict whether your code will be branchless or not, because in LabVIEW you're too far away from the machine code level. Modern optimizing compilers — like Intel OneAPI, VisualStudio or gcc can automatically generate branchless code in certain cases, but this doesn't apply to LabVIEW.

 

Let's start with this LUT-based implementation. On my PC (Xeon w5-2445), it achieves a conversion speed of around 760 MB/s:

 

03.png

It's a major misconception to believe this code is truly branchless, it is not. Try building a Windows application from the code above and running it under a binary debugger like x64dbg. You'll see it immediately, this loop:

image-20251107170015240.png

As you can see, there are three branches and one jump. In total, this loop contains 27 assembly instructions!

 

In a "normal" programming language like C, you'd write the following code, which is fully equivalent to the LabVIEW implementation shown above:

 

image-20251107170251959.png

The only six instructions and truly branchless code, and no magic why this code running 3-4 times faster than native LabVIEW-based for loop:

04.png

There some other "fine" differences, for example, sub/jne instructions pair will be fused into single microcode instruction by CPU, but inc/cmp/jl emitted from LabVIEW - not and so on.

 


@altenbach wrote: It was interesting that it was considerably faster to loop, convert to strings, then convert back to U8's rather than go straight 2D U8's into the function, even setting parallelization aside.

There’s no magic here either. We can use the Intel VTune Profiler to identify performance hotspots in the code.

 

This code is slower:

01.png

because internally uses CharUpperA library function all the time

image-20251107152434578.png

but this code is much faster:

02.png

because uses CharUpperBuffA function instead, which is much more efficient:

image-20251107153111220.png

That is.

Message 12 of 19
(804 Views)

That's some interesting finds @andrey! The compiler team should look and learn from it, there's no reason LV code compiles to 4x the amount of instructions.

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems
Certified-LabVIEW-Developer
0 Kudos
Message 13 of 19
(271 Views)

@Yamaeda wrote:

That's some interesting finds @andrey! The compiler team should look and learn from it, there's no reason LV code compiles to 4x the amount of instructions.


No, unfortunately you can’t simply avoid this overhead by design. Working with native LabVIEW arrays will always be slow (not extremely slow, but noticeably slow).

Are you aware of what happens in these situations below?

image-20251111181733296.png

In the first snippet, we’re reading from an empty array; in the second, we’re reading outside the array bounds. What would you get in a “traditional” programming language? An exception! (Whether handled or not is another question.) And what do we get in LabVIEW? Silence — and a default value returned. Luxurious behavior.

The first two situations are guarded by conditions (1) and (2) below. The RCX register contains a pointer to the data, and if it’s zero, you jump (JE) over the read operation, avoiding an exception. If you’re outside the array, the next guard (JBE) protects you. In both cases, execution jumps to a “safe” area where RCX is filled with a “safe” address containing the default value. And only if we are safe, then LabVIEW will load required value (LEA - Load Effective Address), and jump over default (JMP) - solid blue line below.

Every time you use Index Array, Replace Array Subset, or similar primitives, you’ll have a pair of JE/JBE checks on every iteration, which reduces overall performance (in addition to the huge overhead caused by relative addressing — everything is in memory variables instead of registers).

image-20251111181706713.png

The third jump (JE) is triggered when you hit the red Abort button on the toolbar. Here, we have a hidden global variable in memory. When the loop needs to be aborted, this variable is reset to zero, allowing the loop to exit; otherwise you will stay here forever, it runs until all iterations are complete.

Pseudocode — this is how LabVIEW works internally:

int arr[8] = {1};
int index = 10; // Out-of-bounds index

while (true) {  //this how every LabVIEW loop looks like
    if (arr && (index >= 0 && index < 8)) {
        // Do something with arr[index]
    } else {
        // Deal with Default Value
    }
    if (abortFlag) {
        // Abort
        break;
    }
}

This is why hitting Abort while inside our own DLLs with a tight loop is useless — it doesn’t work because the loop lacks any checks or code to exit on an external event (see the six-instruction assembly listing above).

 

I think I can provide some more "distilled" examples for different situations, but I’ll need a few hours to draw diagrams and capture the corresponding assembly code. Stay tuned for updates on this topic.

Message 14 of 19
(254 Views)

This is incredibly interesting, thanks Andrey! I'm waaaay out of my depth here, but my very brief Googling indicates there are branchless ways to do array index bounds checking. If I understand right, you can use min/max functions (which can be branchless) to check on the array index. The problem is returning a known-good default value, rather than the first or last element of the array.

 

The "simple/naive" solution would be to stick a default value at the end of each array, and clamp the index to the actual length plus one, but changing how long ALL arrays are would likely be a massive issue. I do wonder if there's a clever way to get around that using some chained expressions, though.

 

I'm not sure how to handle the empty array and Abort button jumps...

 

0 Kudos
Message 15 of 19
(241 Views)

@BertMcMahan wrote:

my very brief Googling indicates there are branchless ways to do array index bounds checking. If I understand right, you can use min/max functions (which can be branchless) to check on the array index. The problem is returning a known-good default value, rather than the first or last element of the array.

...


In general, you don't need to check bounds on every iteration, nor do you need to check exit conditions inside high-intensity computation loops (which are usually relatively short in terms of execution time). Such checks should be done outside the loop, before entering it. However, in general, you're right—some branchless design patterns are applicable here as well.

 

This is why I love the Rust programming language: it is extremely safe. Code like this simply will not compile:

fn main() {
    let arr: [u8; 8] = [1; 8];
    let _value = arr[8];

    let empty_arr: [u8; 0] = [];
    let _val_empty = empty_arr[0];
}

and the result:

C:\Users\Andrey\Desktop\rusty_arr>cargo build
   Compiling rusty_arr v0.1.0 (C:\Users\Andrey\Desktop\rusty_arr)
error: this operation will panic at runtime
 --> src\main.rs:3:18
  |
3 |     let _value = arr[8];
  |                  ^^^^^^ index out of bounds: the length is 8 but the index is 8
  |
  = note: `#[deny(unconditional_panic)]` on by default

error: this operation will panic at runtime
 --> src\main.rs:6:22
  |
6 |     let _val_empty = empty_arr[0];
  |                      ^^^^^^^^^^^^ index out of bounds: the length is 0

error: could not compile `rusty_arr`due to 2 previous errors

This is how it should work. This doesn't mean that you will never get completely bug-free code in Rust (logical errors can still lead to panics), but the compiler — with its strict compile-time checks and very strong static typing will protects you by preventing potentially unsafe code from compiling. It does this not only by checking array bounds and preventing null pointer dereferences but also by avoiding double frees, use-after-free errors, ensuring overall memory safety (by the way — without a garbage collector), enforcing borrowing rules and ownership management, preventing race conditions in concurrent code, etc.

Message 16 of 19
(227 Views)

thanks for sharing this information

0 Kudos
Message 17 of 19
(189 Views)

@BertMcMahan wrote:

This is incredibly interesting, thanks Andrey!

 


You're welcome, and thank you for your interest and positive feedback! I'll provide some additional examples — these should be also very interesting for other colleagues and for me myself. (In general, most of the results will be fairly obvious, but there are some interesting observations below.)

 

Ten very simple benchmark experiments (all tests performed on an i7-13850HX), spoiler:

 

Cover Screenshot 2025-11-12 14.34.17.png

Sidenote — all these experiments and benchmarks are performed on built applications, not within the Development Environment (where Debug was disabled as well).

 

Experiment 1 — Decrement with Auto Indexing

1 GiB data:

01.png

around 2.5 GB/s — not too bad, see the assembly code behind this loop (with my comments):

L0: mov rcx, qword ptr ds:[rsi+590]     ; Load pointer to data into RCX
    lea rdx, qword ptr ds:[rcx+1]       ; Increment index
    mov qword ptr ds:[rsi+590], rdx     ; Store incremented index back to [rsi+590]
    dec byte ptr ds:[rcx+1]             ; Decrement the byte at address RCX+1
    mov rcx, qword ptr ds:[rsi+B8]      ; Load pointer for exit check
    cmp dword ptr ds:[rcx+28], 0        ; Compare flag at offset [rcx+28] against 0
    je EXIT                             ; Jump to EXIT if flag is zero (loop abort)
    inc eax                             ; Increment for-loop counter
    cmp eax, dword ptr ds:[rsi+588]     ; Compare loop counter with iteration count
    jl L0                               ; Jump if less - Loop if not all iterations done

It’s not a very efficient way to decrement values in a byte array, but honestly, it’s not bad. However, it uses roughly twice the number of instructions you’d normally expect. Typically, you only need an incremented pointer to the array, a decrement operation, and a loop condition check. Memory should ideally be touched only twice — read the array element and write it back, the rest should be done with registers. Here, we’re touching five different locations almost every instruction. Using LEA with two moves just to increment the index is especially funny — three instructions instead of one.
On the positive side, this code is almost branchless (except for the single JE EXIT, which you can’t avoid in any LabVIEW loop by design).

 

Experiment 2 — Index Array / Replace Array Subset

The approach shown below is not recommended. This code is not equivalent to the previous example and is more than twice as slow:

02.png

Here’s why we have slow loop (this is the longest assembly listing in this comment, but don’t worry):

L0: movsxd rcx,eax                  ; EAX - For loop counter
 	mov rdx,qword ptr ds:[rsi+588]	; |
 	test rdx,rdx                	; |
+---je L1                       	; | FIRST Check   
| 	mov rbx,qword ptr ds:[rdx]      ; | Index Array Check
| 	cmp dword ptr ds:[rbx],eax      ; |
+---jbe L1                          ; |
| 	lea rbx,qword ptr ds:[rcx+rbx+4]; |
| 	jmp L2                          ; |
L1:	lea rbx,qword ptr ds:[rsi+598]  ; Load default if out of range
L2:	mov qword ptr ds:[rsi+590],rbx 	; Here source is OK  
 	mov bl,byte ptr ds:[rbx]        ; get array vale to be decremented
 	dec bl                          ; THIS IS OUR DECREMENT
 	test rdx,rdx                    
 	mov byte ptr ds:[rsi+599],bl    ; Write decremented value to the wire
+---je L3                           ; |
| 	mov rdx,qword ptr ds:[rdx]      ; | SECOND Check
| 	cmp dword ptr ds:[rdx],eax      ; | Same guard as above but for Replace Array Subset
+---jbe L3                          ; |
| 	mov bl,byte ptr ds:[rsi+599]    ; |
| 	mov byte ptr ds:[rcx+rdx+4],bl  ; Write decremented value to the array
L3: mov rcx,qword ptr ds:[rsi+B8]	;   
 	cmp dword ptr ds:[rcx+28],0  	; Check exit conditions   
 	je EXIT                       	;  
 	inc eax                         
 	cmp eax,dword ptr ds:[rsi+580]  
 	jl L0
EXIT:

I may be slightly off or wrong on some fine details of this "reverse engineering", but you can clearly see two JE/JBE pairs — one for Index Array and another for Replace Array Subset. And just look at the amount of code — 27 instructions!

By the way, I’m observing minor performance deviations between these two approaches (which are in theory semantically equivalent):

Screenshot 2025-11-12 12.40.03.png

I haven’t encountered any differences in the assembly (at least between the timestamps). If there is a penalty (if any), it must be somewhere else (every observed and reproducible behavior usually have rational explanation). Maybe I’ll investigate this later. Will keep N connected for the moment.

 

Experiment 3 — In Place Structure

Now we can replace the previous combination with an In Place Element Structure. There are many discussions about whether In Place is faster and what its advantages are. Let’s check it. Indeed, it is faster — 1,49 GiB/s vs. 1,14 GiB/s in the previous example:

03.png

And here’s why it’s faster:

	L0:	mov dword ptr ds:[rsi+584],ebx	;|      
		mov rax,qword ptr ds:[rsi+588]	;|       
		test rax,rax                   	;| SINGLE Check     
	+-- je L1                          	;| Out of bounds check      
	|	mov rax,qword ptr ds:[rax]     	;| for Index Array     
	|	cmp dword ptr ds:[rax],ebx     	;|      
	+--	jbe L1                         	;|      
	|	movsxd rcx,ebx                  ;|     
	|	lea rdi,qword ptr ds:[rax+rcx+4]     
+--<--- jmp L2                          ;| The code always follow this way    
|	L1:	mov rcx,qword ptr ds:[rsi+258]       
| 		lea rdi,qword ptr ds:[rsi+598]       
| 		mov rax,<lvrt.ResetToDefaultDataDSTM> ; Set Default value
| 		mov edx,11                           
| 		mov r8,rdi                           
| 		call rax                      ; Call ResetToDefaultDataDSTM function       
+-->L2:	mov qword ptr ds:[rsi+590],rdi       
 		dec byte ptr ds:[rdi] ; THIS IS OUR DECREMENT               
 		mov rax,qword ptr ds:[rsi+B8] ;       
 		cmp dword ptr ds:[rax+28],0   ; Abort Check      
 		je EXIT                       ;        
		inc ebx                              
 		cmp ebx,dword ptr ds:[rsi+580]       
 		jl L0   

The improvement comes from having only a single check of bounds instead of two. An interesting detail: when retrieving a default value in an In Place structure, LabVIEW calls a function ResetToDefaultDataDSTM() from the Run-Time. Normally, this code is skipped, but if you force it by connecting -1 to the index, you’ll incur a huge penalty — about 10× slower — because the function call introduces significant overhead:

image-20251112142642740.png

On the other hand, if you connect -1 to Index Array/Replace Subset, performance actually improves twice — from 1 GB/s to over 2 GB/s — because only half the checks are executed (see the assembly from the previous experiment), it is not realistic code, just an illustration of how jumps affect performance:

image-20251112142317975.png

In summary, the single array check is a clear advantage of the In Place Element Structure.

 

Experiment 4 — Empty Loop with Shift Register

You’ve probably noticed that I’ve been using a Shift Register. Let’s run it in empty loop — it reaches nearly 20 GB/s:

04.png

And here’s the assembly code behind it: just an empty loop, no data copy, and this is normal speed for such cycle on the given PC:

L0: mov rcx,qword ptr ds:[rsi+B8]   
	cmp dword ptr ds:[rcx+28],0     
	je EXIT                         
	inc eax                         
	cmp eax,dword ptr ds:[rsi+580]  
	jl L0
EXIT:     

Experiment 5 — Empty Loop with Tunnel + Buffer allocation

What happens if we replace the Shift Register with a Tunnel? Let’s try it (better with Buffer Allocation visualization enabled):

05.png

Performance drops from 20  GB/s to 5 GB/s, due to data copying. The copy occurs just before the loop using rep movsb, which is really not very efficient:

mov rax, r11       ; Save r11 in rax for later restoration
mov r11, rdi       ; Move first argument (rdi) to r11 - save source pointer
mov rdi, rcx       ; Move third argument (rcx) to rdi - destination pointer for movsb
mov rcx, r8        ; Move fourth argument (r8) to rcx - count (number of bytes to move)
mov r8, rsi        ; Save rsi to r8 - preserve original rsi
mov rsi, r10       ; Move second argument (rsi) to rsi - source pointer for movsb
rep movsb          ; COPY HERE - Repeat move byte from [rsi] to [rdi] rcx times
mov rsi, r8        ; Restore original rsi
mov rdi, r11       ; Restore original rdi
ret                ; Return from function

A much faster way to copy buffers would use AVX SIMD instructions, moving 128 bytes per cycle:

; rdx = source pointer
; rdi = destination pointer
; Copies 128 bytes from [rdx] to [rdi] using AVX (YMM registers)

vmovdqa ymm1, [rdx]        ; Load 32 bytes from source into ymm1
vmovdqa ymm2, [rdx+32]     ; Load next 32 bytes into ymm2
vmovdqa ymm3, [rdx+64]     ; Load next 32 bytes into ymm3
vmovdqa ymm4, [rdx+96]     ; Load next 32 bytes into ymm4

vmovdqa [rdi], ymm1        ; Store 32 bytes from ymm1 to destination
vmovdqa [rdi+32], ymm2     ; Store next 32 bytes
vmovdqa [rdi+64], ymm3     ; Store next 32 bytes
vmovdqa [rdi+96], ymm4     ; Store next 32 bytes

I was unable to avoid this copy, even with "Always Copy", therefore Shift Register is always preferred (but this is well known).

Conversely, here is the opposite situation — adding Auto Index will slightly increase performance (compare to image from Experiment 2):

image-20251112144241078.png

I have no rational explanation.

What happens if we will enable and use only auto indexing?

05a.png

Performance slightly lower than before.

By the way, in both cases of an “empty for loop,” this could be eliminated by optimization (a single cycle would suffice, or simply bypass in the case of a Shift Register). However, these loops remain in the executable and are executed.

So, here are two improvement points for NI: — implement a more efficient buffer copy (e.g., AVX) and apply more aggressive loop optimization (such as folding).

Experiment 6 — Decrement on Array wire

Every LabVIEW engineer will tell you that the fastest way is to do it like this:

06.png

Indeed — close to 12 GB/s! Why so fast? Because this decrement is not implemented as LabVIEW diagram code; it’s a library function called from the Run-Time, and it looks like this:

	movdqa xmm2, xmmword ptr ds:[6251D790] ; Load xmm2 with 16 bytes, each byte set to 1
	mov eax, r9d                           ; Load loop counter into eax
	nop dword ptr ds:[rax+rax], eax        ; Align instruction boundary to 16 bytes !
L0:
  	movdqa xmm0, xmmword ptr ds:[rcx]      ; Load 16 bytes from src (rcx) into xmm0
  	movdqa xmm1, xmmword ptr ds:[rcx+10h]  ; Load next 16 (+0x10) from src into xmm1
  	psubb xmm0, xmm2                       ; Subtract 1 from each byte in xmm0 (dec)
  	movdqa xmmword ptr ds:[r8], xmm0       ; Store decremented 16 bytes back to dst
  	psubb xmm1, xmm2                       ; Subtract 1 from each byte in xmm1 (next)
  	movdqa xmmword ptr ds:[r8+10h], xmm1   ; Store decremented next 16 bytes back
  	add rcx, 20h                           ; Advance source pointer by 32 bytes (0x20)
  	add r8, 20h                            ; Advance dst pointer by 32 bytes (0x20)
  	sub eax, 1 ; Decrement loop counter
  	jne L0     ; Loop if not zero | These two instructions fuzed into single CPU Micro Op

This is what high-performance code should looks like: 32 bytes incremented per single iteration, no unnecessary bounds checks; no abort checks; minimal memory access; aligned loop start; fused SUB/JNE at the end; interleaved psubb/movdqa for better pipeline utilization. No magic why it achieves close to 12 GB/s.

So, the final takeaway: use such primitives whenever possible.

Experiment 7— Array Rotate

This is a quote from the discussion: “A question from the audience wondered about the efficiency of the Rotate Array function… <…> ‘Rotate Array’ could just mark the array as rotated.”, let check it as well:

07.png

When you see around 12 GB/s, it’s clear that this is a copy operation (not just “mark as rotated”), but it’s implemented quite efficiently (maximal memory throughput on the given PC is around 22 GB/s in case of cache misses, which we have with 1GiB array). Take a look under the hood:

	mov r9,r8; Amount of Elements (without tall)                
	shr r9,7 ; Divide by 128 
	nop word ptr ds:[rax+rax],ax ; Loop alignment - great !               
LOOP:
	movaps xmmword ptr ds:[rcx+10],xmm0 ; 16 bytes per copy    
	movaps xmmword ptr ds:[rcx],xmm1    ; and unrolled 8 times    
	movups xmm0,xmmword ptr ds:[rdx+rcx-10] ; in rdx is - 1  
	movups xmm1,xmmword ptr ds:[rdx+rcx-20] ; unaligned copy with shift 
	sub rcx,80  ; Decrement address by 128 (0x80)                            
	movaps xmmword ptr ds:[rcx+70],xmm0     
	movaps xmmword ptr ds:[rcx+60],xmm1     
	movups xmm0,xmmword ptr ds:[rdx+rcx+50] 
	movups xmm1,xmmword ptr ds:[rdx+rcx+40] 
	dec r9 ; decrement loop counter                                 
	movaps xmmword ptr ds:[rcx+50],xmm0     
	movaps xmmword ptr ds:[rcx+40],xmm1     
	movups xmm0,xmmword ptr ds:[rdx+rcx+30] 
	movups xmm1,xmmword ptr ds:[rdx+rcx+20] 
	movaps xmmword ptr ds:[rcx+30],xmm0     
	movaps xmmword ptr ds:[rcx+20],xmm1     
	movups xmm0,xmmword ptr ds:[rdx+rcx+10] 
	movups xmm1,xmmword ptr ds:[rdx+rcx]    
jne LOOP ; based on result of dec r9 to lvrt.62201A10 

So, it’s essentially a shifted copy, but implemented in a very professional way: 8x times unrolled loop; SIMD instructions for parallel processing and loop alignment for better performance.

The only minor point is the use of MOVAPS/MOVUPS (Move Aligned/Unaligned Packed Single-Precision Floating-Point Values), which are fast but not ideal for integer data. There’s still room for improvement using VMOVDQA (aligned integer moves).

 

Experiment 8 — Array Reverse

Let’s check this statement too: “For example, Reverse Array can just mark the array to be indexed from the back.”:

08a.png

Yes, that’s correct — "Reverse Array" really just marks the array as reversed, because its execution time is effectively zero (the resolution of LabVIEW’s High-Resolution Timer is 100 ns). This demonstrates how smart LabVIEW’s memory allocation is for array: instead of moving data, it sets internal Reverse/Transpose (but not Rotate!) flags, which is perfectly efficient.
However, not every subsequent primitive can handle a “reversed” array. If you insert Rotate Array afterward (or any other function that touches raw elements, such as increment/decrement), performance immediately drops by a factor of 20! As shown above, Array Rotate normally runs at ~6 GB/s — but not in this case:

08b.png

Again, in the other "Words":

Inverse+Rotate Screenshot 2025-11-12 14.59.33.png

Now, most of the time is spent in Reverse (or more precisely, somewhere between Reverse and Rotate). Here’s the machine code behind Reverse, which becomes the bottleneck:

LOOP:
	mov rax,qword ptr ss:[rsp+34]    
	mov rcx,qword ptr ss:[rsp+44]    
	movzx ecx,byte ptr ds:[rcx]      
	mov byte ptr ds:[rax],cl         
	movsxd rax,dword ptr ss:[rsp+40] ; load -1 to RAX from memory!
	add qword ptr ss:[rsp+44],rax ; Strange way to decrement address by add -1  
	inc qword ptr ss:[rsp+34]  ; Increment address      
	mov eax,dword ptr ss:[rsp+28]    
^	dec eax  ; DECREMENT Loop counter                        
|	mov dword ptr ss:[rsp+28],eax    
|	test eax,eax                     
+-<-jg LOOP   

And I could implement a reversed copy slightly faster — believe me.

 

Experiment 9 — Array Transpose

In general, the same principle applies here: Array Transpose simply sets an internal Transpose Flag indicating that the array is transposed — at zero cost:

09.png

However, this magic disappears if we insert a Decrement afterward. Now performance drops to less than 100 MB/s, because the transpose operation physically moves the data byte by byte prior to decrementing (even though that’s completely unnecessary for a decrement):

09a.png

If you think decrement is inherently slower for 2D arrays — it’s not. Let’s check:

09b.png

It can operate at 12 GB/s, but combined with transpose, performance falls by a factor of 100× — from 11,8  GB/s to 0,118 MB/s:

Transpose+Decrement Screenshot 2025-11-12 15.05.32.png

Intermediate Summary:

Transpose and Reverse work well and fast as long as all subsequent primitives support these flags.
Once a primitive that doesn’t recognize these flags touches the wire, the optimization disappears.

As a result, you cannot reliably predict performance or decide which LabVIEW code will give you a boost. The only way is to try, benchmark different approaches, and choose the best one. Effective use of cache memory is also important, but that’s outside the scope for now (hence the choice of a 100 MB array for these experiments).

Back to the original message — just one more and last thing:

 

Experiment 10 — Branched vs Branchless code

In general, this approach applies to the code from the first message. A clear indication that the code is branched is when execution time varies for different input data, depending on whether branches are taken. Let’s start with a simple benchmark (I fixed a small issue with swapped inputs in the original code):

10a.png

On my i7-13850HX, this code produces the following performance results:

 

Mixed data (interleaved: 8 letters followed by 8 numbers): ~1069 MiB/s
All letters (all branches taken): ~1304 MiB/s
All numbers (no branches taken, no subtraction): ~1696 MiB/s

 

This demonstrates how branch prediction works. In the first test, the predictor is disrupted by switching branches every 8 iterations. In the next two tests, the same branch is consistently taken (for All Letters) or not taken (for All Numbers). Since more computation requires more CPU time, there’s a difference of about 300 MiB/s between the tests.

Assembly code with branch (JE L3):

L0:	mov rcx,qword ptr ds:[rsi+500]   
 	lea rdx,qword ptr ds:[rcx+1]     
 	mov qword ptr ds:[rsi+500],rdx   
 	mov cl,byte ptr ds:[rcx+1]       
 	cmp cl,byte ptr ds:[rsi+27F9]    
 	jbe L1                           
 	xor ecx,ecx                      
 	jmp L2                           
L1:	cmp cl,byte ptr ds:[rsi+27F8]
 	setae cl                         
L2:	mov byte ptr ds:[rsi+4F8],cl     
 	test cl,cl                       
+---je L3 ; THIS IS YOUR CONDITIONAL BRANCH                            
| 	mov rcx,qword ptr ds:[rsi+500]   
| 	add byte ptr ds:[rcx],E0         
L3:	mov rcx,qword ptr ds:[rsi+B8]    
 	cmp dword ptr ds:[rcx+28],0      
 	je EXIT                          
 	inc eax                          
 	cmp eax,dword ptr ds:[rsi+4EC]   
 	jl L0
EXIT:

Now, your branchless version, ~1343 MiB/s:

10b.png

Assembly code behind it — no branch anymore:
L0:	mov rcx,qword ptr ds:[rsi+590]
	lea rdx,qword ptr ds:[rcx+1]     
	mov qword ptr ds:[rsi+590],rdx   
	mov cl,byte ptr ds:[rcx+1]       
	cmp cl,byte ptr ds:[rsi+6FA]     
	jbe L1                           
	xor ecx,ecx                      
	jmp L2                           
L1:	cmp cl,byte ptr ds:[rsi+6F9]  
	setae cl                         
L2:	mov byte ptr ds:[rsi+598],cl  
	shl cl,5 ; Multiply by 32 - good job, NI                         
	mov rdx,qword ptr ds:[rsi+590]   
	sub byte ptr ds:[rdx],cl ; Subtract 0 or 32        
	mov rcx,qword ptr ds:[rsi+B8] ; NO BRANCH    
	cmp dword ptr ds:[rcx+28],0      
	je EXIT                          
	inc eax                          
	cmp eax,dword ptr ds:[rsi+588]   
	jl L0
EXIT:

This branchless version demonstrates stable but average performance and is independent of the input data.

br.png

Summary — Branchless programming is also beneficial in LabVIEW; however, for very high-performance computations, use custom DLLs (C code compiled with Visual Studio, Intel OneAPI, gcc or may be Rust) instead of native LabVIEW code.

The code used in this comment is attached (LV2017).

Andrey.

PS

I might be wrong on some points; please don’t hesitate to correct me.

Download All
Message 18 of 19
(88 Views)

Very thorough investigations. Different parts of the results touch on a lot of different aspects of LV compilation. I'm not going to try to cover everything but I will touch on a couple.

 

Parallel for loops

You are correct that the amount of work in each iteration of a parallel for loop is a factor in the performance gain that is provided. However, other factors are bigger.

 

One consideration is oversubscription. CPUs are not infinitely parallel. They contain a finite number of cores. If there are more things to do then there are cores, then the cores bounce back and forth between all those tasks creating overhead. A single parallel for loop is designed to avoid getting into that situation. It never by default creates more parallel tasks than there are cores. However, we don't look for situations where multiple parallel for loops are running at the same time. If you call 2 subVIs in parallel that both contain parallel for loops, then they both may try to consume every core so there are twice as many tasks to do than cores. By nesting parallel for loops, you've magnified that. In the initial VI, it generates the data with 2 nested parallel for loops which takes about 24 seconds for me. If I make the inner loop not parallel, then it takes less than 6.

 

If you make only the inner loop parallel, performance actually goes down. This isn't because of the size of the work in each iteration though. This is because of the overhead of starting up and shutting down the parallel tasks. When you have parallel loops, you pretty much always want the outer one to be parallel. This means you are paying the synchronization overhead of start/stop only once. When the inner loop is parallel, then every iteration of the outer loop is going through all this overhead which is a much bigger factor. I said before that with only the outer loop parallel, the data can be generated in about 6 seconds. If it swap the inputs to the 2 loops, the performance is about the same. So whether I'm going 1000000 parallel runs of 1024 iterations or 1024 parallel runs of 1000000 iterations, has very little impact.

 

To Upper

Obviously our code generation here is less than optimal. There is no reason that this operation should be slower of a byte array than on a string. I don't think we ever considered this node being used in time critical situations so it wasn't heavily scrutinized. I'll make sure this is noted and see if we can improve it in the future.

0 Kudos
Message 19 of 19
(39 Views)