LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Got some interesting benchmarks on To Upper vs. "Branchless Programming"

So I was watching this video on branchless programming in C and decided to try it out in LabVIEW just for fun. I think I have benchmarked things decently well, but I recognize that's always a spicy topic 🙂

 

Anyway, the claim is that conditional operations like If statements cause the processor to not be able to "guess ahead" of what upcoming steps will be, and it slows things down. With careful algorithm tweaks (and a MASSIVE hit to readability!), you can speed things up by basically letting the processor leverage quicker assembly-level commands. A conditional Jump, the video claims, is slower than a conditional Move Value in Assembly land.

 

So, the setup is 1 million rows of 1024 characters to be run through a To Uppercase function. I made this array out of U8's, then sent it through a variety of different "To Upper" algorithms. Some interesting results:

 

1- For loops benefit by about 10x by parallelizing the outer loop (the one operating on 1 million rows). You take a significant penalty if you parallelize the inner loop as well, since it's only 1024 elements. (Pretty common knowledge.)

2- Using In Range? to check for a capital letter ASCII code, followed by a Case Structure to subtract 32 from the number if it's lowercase, took 5.3 seconds without parallelization/0.47 seconds with parallelization.

 

BertMcMahan_0-1762455392249.png

 

3- Using In Range? to again check for capital ASCII, converting that to a Boolean, multiplying that by 32, and subtracting that number from ALL items got me over 2x speed improvement, down to 2.5 seconds (0.23 seconds parallel).

BertMcMahan_1-1762455431056.png

4- Using the built-in To Uppercase? function, which can operate directly on a 2D U8 array, took a whopping 26 seconds to execute.

BertMcMahan_2-1762455465953.png

5- Pre-creating an array to use as a lookup table was about the same as the "Branchless" code above, at 2.5 seconds.

BertMcMahan_3-1762455517789.png

 

6- Doing the same with a Map was the worst case at 40.6 seconds.

BertMcMahan_4-1762455542614.png

7- Running a For loop, casting each row to a string, running the built-in To Upper, then casting back, all in parallel, took 0.70 seconds (assuming the non-parallel version had a similar gain to above, that puts the single-loop version at about 7 seconds).

BertMcMahan_5-1762455584718.png

 

 

tl,dr: The built-in To Upper function isn't particularly well optimized for U8's, according to my benchmarks. It's better with strings, but still isn't that great, being slower than an If statement. The fastest method is to either use a lookup table or use the cryptic "branchless" style of coding.

 

I'd have thought To Upper was probably already a LUT, but it's not acting like one.

 

Ugly code attached, backsaved to 2021. I welcome criticism of my benchmarking techniques. Debugging was disabled. No changes to execution priority.

0 Kudos
Message 1 of 12
(285 Views)

Hi Bert,

 


@BertMcMahan wrote:

tl,dr: The built-in To Upper function isn't particularly well optimized for U8's, according to my benchmarks. It's better with strings, but still isn't that great, being slower than an If statement. The fastest method is to either use a lookup table or use the cryptic "branchless" style of coding.

 

I'd have thought To Upper was probably already a LUT, but it's not acting like one.


ToUpper does a lot more than just subtracting 0x20 for U8 values between 0x61…0x7A, as shown with German umlauts:

So you cannot expect it to run "fastest"…

 

I typically use the "cryptic branchless code" when converting hex-formatted strings into U8 values. (I still expect uppercase A…F in such strings.)

Best regards,
GerdW


using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019
0 Kudos
Message 2 of 12
(277 Views)

Here's what I might do 😄

altenbach_0-1762460095968.png

 

...and for some parallelization, retain the loop:

 

altenbach_0-1762460347129.png

 

(TuUpper understands U8 arrays directly)

 

(Cannot test with your default inputs, my VM is running out of memory... 😮 )

0 Kudos
Message 3 of 12
(243 Views)

@GerdW wrote:

ToUpper does a lot more than just subtracting 0x20 for U8 values between 0x61…0x7A, as shown with German umlauts:


Here's the correct way to do the LUT (Yes, you can make it shorter but it seems better to return the eight bit codes unchanged instead of replacing them with zero). Creating the LUT is peanuts (and probably gets constant folded).

 

Parallelizing the outer loop of course depends on the number of cores available. But here's a fast solution:

 

altenbach_0-1762461272354.png

 

A map is no advantage, because the LUT is already sorted by index.

 

0 Kudos
Message 4 of 12
(233 Views)

@altenbach wrote:

Here's what I might do 😄

altenbach_0-1762460095968.png

 

...and for some parallelization, retain the loop:

 

altenbach_0-1762460347129.png

 

(TuUpper understands U8 arrays directly)

 

(Cannot test with your default inputs, my VM is running out of memory... 😮 )


Yeah, I tried that one (#4). It wasn't very fast, but of course I forgot about other languages! :facepalm:

 

It was interesting that it was considerably faster to loop, convert to strings, then convert back to U8's rather than go straight 2D U8's into the function, even setting parallelization aside.

 

 

Apologies on the tremendous memory usage... I have a boatload of RAM, so I added an Always Copy before each trial to try to prevent the compiler from doing any shenanigans.

0 Kudos
Message 5 of 12
(226 Views)

One thing that puzzles me is why the LUT is (very slightly) faster than the ToUpper primitive.

 

I would assume that it also incorporates a simple LUT, but maybe there is a slight calling overhead.

0 Kudos
Message 6 of 12
(217 Views)

@GerdW wrote:

ToUpper does a lot more than just subtracting 0x20 for U8 values between 0x61…0x7A, as shown with German umlauts:


Here's the full LUT:

 

altenbach_1-1762464216729.png

 

 

I assume there are some industry conventions, e.g. xFF is x9F in "uppercase". 😄

0 Kudos
Message 7 of 12
(206 Views)

Hi Christian,

 

0x9F = Ÿ is uppercase of 0xFF = ÿ

 

Seems totally ok for me…

 

(No, this is not a German umlaut. 🙂 )

Best regards,
GerdW


using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019
Message 8 of 12
(180 Views)

The "branchless" hex-formatted text to ASCII routine:

Prerequisites:

  • uppercase chars A…F
  • no error handling for invalid chars (only 0…9, A…F is expected)
  • input should contain an even number of chars…
Best regards,
GerdW


using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019
0 Kudos
Message 9 of 12
(160 Views)

@GerdW wrote:

Hi Christian,

 

0x9F = Ÿ is uppercase of 0xFF = ÿ

 

Seems totally ok for me…

 

(No, this is not a German umlaut. 🙂 )


Yes, I tested this and noticed. Here are some of the countries where it is used. The curiosity was that the distance between upper and lowercase is significantly higher than for any other pair. They must have run out of slots. 😄

 

Lettercase conversion will probably be more complicated with unicode... Haven't looked into it.

 

 

0 Kudos
Message 10 of 12
(132 Views)