04-08-2025 03:00 AM - edited 04-08-2025 03:01 AM
@altenbach wrote:
Performance is fantastic, with a linear speedup with the number of CPU cores.😄
To be honest, I slightly disagree about the performance, with all my respect to LabVIEW. The multithreading in LabVIEW is fantastic, that is true. It is really simple to drop multiple while loops or parallel for loops on the BD, and they execute in parallel with moderate overhead. I don't know of any other programming language that allows us to do this in such a simple way by default out of the box (without adding OpenMP or rayon etc). However, the performance of the native LabVIEW code is far from "fantastic" when we are talking about computation-intensive operations on large arrays or complicated structures, for example. I can beat almost any LabVIEW code by rewriting it in C, compiled with a good modern compiler like Visual Studio or Intel OneAPI. Additionally, I can utilize modern CPU commands like AVX2/AVX512, either with intrinsics or directly in Assembly. I have demonstrated this many times — 1, 2, 3, 4, 5, 6 — and I am ready to demonstrate it again. For example, I will compare the SHA256 native LabVIEW implementation
against simple Rust code, which is just 10 lines behind:
#[unsafe(no_mangle)]
pub extern "C" fn compute_sha256(input_ptr: *const c_void, input_len: usize, output_ptr: *mut u8) {
let input = unsafe { slice::from_raw_parts(input_ptr as *const u8, input_len) };
let mut hasher = Sha256::new();
hasher.update(input);
let result = hasher.finalize();
unsafe {
let output_slice = slice::from_raw_parts_mut(output_ptr, 32);
output_slice.copy_from_slice(&result[..]);
}
}
The result on my Xeon W5-2445, which natively supports SHA instructions, is a 28x speedup:
(Without an SHA-compatible CPU, you will get a factor of 5-6, I guess). The relatively poor performance of the native LabVIEW code is compensated by multithreading in some cases, but if you need really fast computation and to utilize full CPU power, then the "bottlenecks" need to be rewritten and replaced by DLL calls.
I still have a lot of fun using LabVIEW (this year is an anniversary for me — I've been with LabVIEW for 25 years since v6.0i). Nowadays, I have even more fun with Rust and AI, and combining with LabVIEW. It is an awesome time for every curious software engineer!
04-08-2025 06:39 AM
@Andrey_Dmitriev wrote:
@altenbach wrote:
Performance is fantastic, with a linear speedup with the number of CPU cores.😄
(Without an SHA-compatible CPU, you will get a factor of 5-6, I guess). The relatively poor performance of the native LabVIEW code is compensated by multithreading in some cases, but if you need really fast computation and to utilize full CPU power, then the "bottlenecks" need to be rewritten and replaced by DLL calls.
I still have a lot of fun using LabVIEW (this year is an anniversary for me — I've been with LabVIEW for 25 years since v6.0i). Nowadays, I have even more fun with Rust and AI, and combining with LabVIEW. It is an awesome time for every curious software engineer!
If you compile LV code to a DLL, will it improve performance?
04-08-2025 07:00 AM
@Yamaeda wrote:If you compile LV code to a DLL, will it improve performance?
Basically, no. Moreover, you will get additional overhead from the DLL's call and the associated penalties, but the machine code will be the same for equivalent LabVIEW code in an executable or in a DLL. The reason why LabVIEW-generated code is slower is due to additional checks, allocations in upper memory and some other reasons. For example, if you have an array with 10 elements and try to access the 11th element, which is out of range, nothing happens in LabVIEW (you will get a default value). However, in C or Rust, you will get a hard exception or panic in unsafe code. It is the same for Python or C#—high-level or managed languages add some luxury, but not for free. Therefore, we use numpy or high-performance math libraries and must be very careful when mixing managed and unmanaged code. Also, LabVIEW (which is written itself in C++) is still compiled with VS2015, which is quite old. But overall, the performance is not so bad for such a graphical environment. An easiest way to check — just compile more or less equivalent code into DLL from LabVIEW and from C, then diassemble both using Ghidra or IDA, compare side by side and then you will see immeaditely a huge overhead just as used amount of CPU instructions to achieve "same goal".
04-08-2025 03:23 PM
@Andrey_Dmitriev wrote:
@altenbach wrote:
Performance is fantastic, with a linear speedup with the number of CPU cores.😄
However, the performance of the native LabVIEW code is far from "fantastic" when we are talking about computation-intensive operations on large arrays or complicated structures, for example. I can beat almost any LabVIEW code by rewriting it in C, compiled with a good modern compiler like Visual Studio or Intel OneAPI. !
The core computation in this particular case is actually a DLL from 30 year old Fortran code and rewriting it from scratch in any modern language, even in native LabVIEW would take me years. (In the nineties, I was actually running an early version on a VAX-11/780 using a serial terminal with graphics capabilities. One spectrum took about a minute and now it is milliseconds. 😄 )
Sure, I can spend months debugging and optimizing text code to shave a few ms, or I can write it in LabVIEW in a day, fully debugged, and I even get an UI for free. 😄
OTOH, my LongDistances program is pure LabVIEW and runs circles around any Matlab competitors, especially for direct user interaction and instant recalculations. Even a non-negative Tikhonov regularization takes milliseconds for any reasonably sized problem (~100k matrix elements). A 10x speed increase would not even be noticeable 😄
I am curious what performance improvements do you actually see? 10%? 2x? 10x?
04-09-2025 03:30 AM
@altenbach wrote:I was actually running an early version on a VAX-11/780 using a serial terminal with graphics capabilities. One spectrum took about a minute and now it is milliseconds. 😄 )
Wow, VAX-11/780! This machine was installed next door in our lab where I wrote control and analysis software for an x-ray diffractometer on a DEC PDP-11 clone (love this) using the RT-11 FB OS, more than 30 years ago. I think it had an 8 MHz CPU and 56 kB RAM, equipped with a 1 MB external RAM drive and a 5 MB HDD. It was not fast, but I was able to use Foreground/Background functionality, so the operator was able to perform analysis of the diffraction curves in parallel with the experiment.
Well, optimizations obviously are not necessary everywhere, as Donald Knuth said — "premature optimization is the root of all evil," and he is perfectly right. In my usual cases, I need to perform some computations on gigabytes of x-ray images, and here working with classical LabVIEW arrays is a no-go. I use IMAQ, sometimes Intel IPP, sometimes OpenCV, and sometimes develop my own algorithms in C using SIMD instructions. Sometimes using GPU. Nowadays it makes no sense to develop directly in Assembly; modern compilers are efficient enough, but I still can.
The downside is time consumption — developing complicated math from scratch can take a huge amount of time for programming and debugging. LabVIEW code is very helpful for prototyping in this case. Additionally, highly optimized code is hard to support and modify, plus we have additional dependencies (if I use Intel OneAPI, then I'll need additional runtime, etc.), which increases the cost of long-term support.
Expected performance improvement depends on the algorithm and the amount of data. DLL calls add additional overhead, and for very short "few milliseconds" computations, the LabVIEW code inlined into the caller can be comparable or even faster than DLL. For example, SHA-256 when running over 1 kB of data only, then performance improvements are not so good — about 5-7 times. But the absolute execution time of the LabVIEW code is around 70 µs, and it makes absolutely no sense to reduce this to 10 µs — this is not noticeable, as you said. Additionally, if the LabVIEW code uses a lot of math functions (like mean, median, fitting, etc.) — they are already wrapped in DLL (I mean lvanlys.*), then it will be hard to improve, obviously.
Usually, I do not participate in optimization contests, but last summer was an interesting exception. In the "Word Search" challenge, I noticed that C code was not prohibited. I immediately grabbed suitable open source from GitHub, wrapped it into DLL, and was immediately in first place. But then Derrick wrote to me that the external DLL is not acceptable (safety reason), but I still was able to reach second place with pure LabVIEW code and a series of "dirty tricks." In the last minute, Greg McKaskle beat me; he was less than 1 ms ahead. But then I took revenge in the second challenge — Wordle, where I won first place (but the second challenge was not about performance).
But in general, I can accept almost any optimization challenge with pleasure — If you share pure LabVIEW code as a benchmarked snippet, then I can try to estimate improvement and optimize it (or at least try to optimize it), but only in more or less trivial cases. Last winter, I optimized Gauss-Lobatto and got around 70x improvement on an 8-core CPU, but I was very lucky because the C source code was already developed and shared on the internet; otherwise, this task would simply not fit into a coffee break. Developing such code from scratch is probably an interesting exercise, but as a student, I had more than enough such exercises in numerical methods.
04-09-2025 10:57 AM
@Andrey_Dmitriev wrote:
@altenbach wrote:I was actually running an early version on a VAX-11/780 using a serial terminal with graphics capabilities. One spectrum took about a minute and now it is milliseconds. 😄 )
Wow, VAX-11/780! This machine was installed next door in our lab where I wrote control and analysis software for an x-ray diffractometer on a DEC PDP-11 clone (love this) using the RT-11 FB OS, more than 30 years ago. I think it had an 8 MHz CPU and 56 kB RAM, equipped with a 1 MB external RAM drive and a 5 MB HDD. It was not fast, but I was able to use Foreground/Background functionality, so the operator was able to perform analysis of the diffraction curves in parallel with the experiment.
For my thesis work in Switzerland, I actually used an LSI-11 version of the PDP 11 with a Tektronix vector terminal where text editing was so much "fun". Basically you stared at the text counted where you needed to do an edit, gave special one letter commands to insert or delete some characters, then refreshed the document to see if you did it right. Pressing <esc><esc> to go back to command mode became almost second nature, maybe you remember these times. (I don't remember most of the details, though)
Way before that, in high school, things were even less interactive. We programmed with a #2 pencil on specially lined paper, later went in line for a punch-card machine (one card by line), took the card stack to the city and fed it into a CDC3200 to run (room full of cabinets, 92kB(!), or 32k of 24bit words), then waited for the output from a gigantic noisy printer. If you made a mistake, a retry would need to wait for another week. 😄
In comparison, the absence of an "undo" in LabVIEW 4.0 was not really an issue. The motto was always "think twice...wire once!".
Yes, I remember the word search challenge and actually participated. Was Greg's code ever made public?
04-09-2025 11:46 PM - edited 04-10-2025 12:11 AM
@altenbach wrote:
Yes, I remember the word search challenge and actually participated. Was Greg's code ever made public?
We exchanged some code on Discord after the contest, but I haven’t seen Greg’s code. As for my solution (which I don’t have on hand at the moment), as far as I remember, I first rearranged the input data to achieve sequential memory access to vertical and diagonal elements. Then, I performed the search using parallel loops. The rest of the work was essentially «code balancing» based on time feedback (a kind of «reinforcement machine learning»).
However, in real-life scenarios, the LabVIEW code would be the last thing I’d optimize. The obvious and more or less unavoidable bottleneck was the string search primitive itself. If it were replaced with a faster version (e.g., StringZilla)
SEARCHSTR_API int32_t fnSearchStrAVX2(LStrHandle str, LStrHandle pattern)
{
GUARD(str, pattern)
// Initialize your haystack and needle
sz_string_view_t haystack = { (sz_cptr_t)((*str)->str), (sz_size_t)(*str)->cnt };
sz_string_view_t needle = { (sz_cptr_t)((*pattern)->str), (sz_size_t)(*pattern)->cnt };
sz_cptr_t substring_position = sz_find_avx2(haystack.start, haystack.length, needle.start, needle.length);
sz_find_t* res = (sz_find_t*)substring_position;
int64_t offset = (substring_position - haystack.start);
if (substring_position) return (int32_t)offset;
else return -1;
}
we could achieve a 25x boost in performance on a reasonable amount of data. For example, on my i7-4940MX execution time could drop from 2 seconds to just 80 milliseconds, as shown in the example below:
Of course, I tried implementing different approaches (from this perspective, the challenge was an excellent refresher on Rabin-Karp, Knuth–Morris–Pratt, and Boyer–Moore algorithms), but throughout the process, I couldn’t avoid the feeling that this challenge was like «heavy trucks racing» while I had a Ferrari sitting in my garage.
04-10-2025 07:41 AM
@Andrey_Dmitriev wrote:
we could achieve a 25x boost in performance on a reasonable amount of data. For example, on my i7-4940MX execution time could drop from 2 seconds to just 80 milliseconds, as shown in the example below:
Of course, I tried implementing different approaches (from this perspective, the challenge was an excellent refresher on Rabin-Karp, Knuth–Morris–Pratt, and Boyer–Moore algorithms), but throughout the process, I couldn’t avoid the feeling that this challenge was like «heavy trucks racing» while I had a Ferrari sitting in my garage.
I did some tests and Search and Replace is twice as fast on my computer. I've always felt string functions are fairly slow, so i tried with keeping it as a U8 array and do the same, but the results was about as bad.
04-10-2025 09:53 AM
@Yamaeda wrote:
I did some tests and Search and Replace is twice as fast on my computer. I've always felt string functions are fairly slow, so i tried with keeping it as a U8 array and do the same, but the results was about as bad.
One of my first cheap optimization was to replace search string with "search/split string" and performance improved dramatically.
In the end, I was #7, and at 0.0688s just about 10x slower than Greg (0.0064s). Code attached below.
OTOH, the lowest score was 203s and the scores covered a range of several hundred thousand! This also means that well written LabVIEW code will be many orders of magnitude faster than inefficient LabVIEW code and any additional text code efforts are relatively marginal. 😄
All that said, Personally I dislike challenges with pink wires and I did not even participate in the Wordle. I loved the Reversi challenge and I am still frustrated that it has not even been scored yet. 😞 My dumb-but-fast implementation was able to generate a quarter million moves on my laptop. 😮
04-10-2025 12:43 PM
@altenbach wrote:All that said, Personally I dislike challenges with pink wires
One of the first things I did in that challenge was convert to U8 arrays and then Search 1D Array did the heavy lifting for me. Got me to 4th place.