LabVIEW Idea Exchange

cancel
Showing results for 
Search instead for 
Did you mean: 
Petru_Tarabuta

Many native VIs that are currently Non-reentrant should be Preallocated clone

Status: New

Problem: Many native VIs use the Non-reentrant execution reentrancy setting.

 

Solution: The vast majority of native VIs should use the Preallocated clone reentrancy setting.

  • The native VIs that need to use Non-reentrant or Shared clone are few and far between - they should be identified on a case-by-case basis. Their Context Help and/or Detailed Help should explain why they need to be set to Non-reentrant or Shared clone.

The following is a selection of vi.lib VIs that should use Preallocated clone. This selection is meant to serve as a starting point and is not comprehensive.

 

1.png

2.png

3.png

 

Notes:

  • This idea is related to: The reentrancy of new VIs should be "Preallocated clone" . They both argue in favour of using the Preallocated clone setting more.
  • A significant number of native VIs are already configured to use Preallocated clone, which is great.
  • There are curious cases where closely related VIs are set to different reentrancy settings. For example, Color to RGB.vi is rightly using Preallocated clone, while RGB to Color.vi is Non-reentrant. Similarly, Trim Whitespace.vi is rightly Preallocated clone, while Normalize End Of Line.vi - which lives next to it on the String palette - is Non-reentrant.
    • This suggest that the reentrancy setting of some native VIs was chosen haphazardly. This needs to be rectified.
  • The fact that so many native VIs are non-reentrant partly defeats LabVIEW's remarkable ability to create parallel code easily. Loops that are supposed to be parallel and independent are in fact dependent on each other when they use multiple instances of these non-reentrant native VIs. When an application uses multiple instances of these native VIs it is as if there are "hidden semaphores" that are added between the various call locations that call these native VIs. This leads to less performant applications (more CPU cycles, longer execution time, larger EXE compiled code size).
12 Comments
fefepeto_kb
Member

Nice idea, but might be a bit rushing for conclusions.

Having preallocated clones seams like benefits only, but it also has some downsides, lets begin with the obvious one: memory consumption. If the clones were preallocated for 100 instances then it would consume a lot of memory, which might not be desirable in applications that scale large.

The other backside might be a bit less obvious: dynamic allocation performance. Yes, it does happen, especially with the use of factory pattern. Every time a new instance is crated all the things shall be opened. It will take time, obviously.

 

More importantly, some of the code executes quicker than a new instance is create, i.e. the Is Path and Not Empty VI.

 

But, the be realistic, the first two point are not how it happens in real life. Even with preallocated clones, LabVIEW tries to reduce the hardware resource footprint, therefore it estimates how many concurrent executions might be needed.  Then, it opens a number of clones, which you can expect to top out at the number of available logical cores. But, more often then not fewer clones are enough, about 2-3 depending on how long the VI executes.

 

Another aspect is, that the VIs actually acting on files will be restricted by the OS ability to parallelize those efforts (and the underlying hardware's ability), which is usually not that parallel.

 

The other thing to consider is whether the access is to the same memory address or not. In case we try to access the same INI file in parallel loops, then it will always be non-reentrant, since the content of the INI file is stored in a queue, which won't allow parallel reads as far as I know. The only re-entrant application of the config file VIs would be to use two separate files in the two separate loops. Not saying it cannot happen, but an INI file is supposed to be the initiator file of the application, therefore it would rather be a single file, or maybe an additional file for the user specific settings. If further reads are done, then parallelizing it wouldn't help.

 

So the bottom line is, that certain applications might benefit from parallel execution, but for some functions or applications parallelizing is not possible or not worth it.

Mads
Active Participant

If a VI can:

a) Take a noticable time to execute (relative to the execution time expectations of scenarios it might be part of, including the risk of a time increase due to parallell use and the lack of reentance...)
b) Is natural to run in multiple parallell instances and
c) Does not need (or can be redesigned to not need) access to values from previous runs..

 

then I agree...Based on this I would say quite a few of the VIs you mention should remain non-reentrant, but the VI that has to run potentially time consuming searches e.g. would be better to have reentrant.

Petru_Tarabuta
Active Participant

Hi fefepeto_kb and Mads. Thanks for your replies.

By way of replying, I am copying below several screenshots that illustrate a detailed conversation that took place on this topic in the #water-cooler channel of the LabVIEW Discord Server between 16 and 19 March 2025.

Petru_Tarabuta_0-1742418696843.png

 

Petru_Tarabuta_1-1742418770836.png

 

Petru_Tarabuta_2-1742418813733.png

Petru_Tarabuta
Active Participant

Petru_Tarabuta_0-1742418886463.png

 

 

Petru_Tarabuta_1-1742418926898.png

 

 

Petru_Tarabuta_2-1742418957590.png

Petru_Tarabuta
Active Participant

Petru_Tarabuta_0-1742419017199.png

 

Petru_Tarabuta_1-1742419056141.png

fefepeto_kb
Member

Ok. Here comes some very, extremely running on the hardware low level stuff:

The way processors operate is very rudimentary, each core is capable of accessing the these two inputs: program code and program data, btw these are specified separately for this very reason in the L1 and L2 cache of modern CPUs.

The processor starts with the program counter value of 0, then enters a long running cycle: execute one code prom the program code cache, increase the counter. If the code tells so, it grabs the data from the program data cache and does the operation is supposed based on the code.

So far so good.

Any low level operation, like move a data into the working register, or adding values, moving the result back to memory are low level operations defined in the CPU datasheet or instruction set, like the x86.

(The + operation is taking ~4 clock cycles in traditional execution).

 

Great, but what happens if there are loops. Now we have to introduce the ability to jump in the code, by doing operations on the program counter. If I have to skip ahead 10 pieces of code, then I will jump ahead 10 units (bytes, words, long integers, whatever the CPU is designed for) in the program cache by adding 10 to the program counter.

 

Even better now. We can have if else statements, and believe it or not, the CPUs still do loops this way even today.

 

But what about creating maintainable code and instead of repeating every complex operation by coping it or looping trough, we could call them as functions, procedures or methods?

Well, CPUs can handle these operations with another hardware feature which is called stack. The stack is a limited size LIFO buffer, where the processor sores the program counter every time it has to halt the execution of the current function and jump into another one. This is also what happens for recursion. Every time the function "returns" the execution is jumping back in the previous function to the program counter stored as last element in the stack.

 

Great. We can run the code easily now, but what happens when we load a code in the memory. Well, it comes from the HDD, SSD, USB or another memory space, but in general it gets loaded in the RAM, then moved in the L3 cache, then propagated down trough the L2 cache and finally in the L1 cache for execution.

Great, so this surely takes a loads of time. Well, it depends. If the code is already in the memory then it can be duplicated for execution using a command called memcopy. This is a fast copy algorithm that can duplicate things in RAM. The problem, it's a CPU operation too. And power hungry. Although it is fast, RAMs are not 0 consumption devices. And coping it in the RAM is not enough for execution. Then begins the sequence of pushing it down to the cache. Thankfully, the scheduler of the processor is taking care of this. Yay, another hardware module.

 

But, when data has to be shared between CPU cores it has to travel from L1 cache of one core, to the L3 cache and then back to L1 of the other core again. Although it takes no CPU operation per the instruction set, it still takes some time for the scheduler to move this data.

 

And to confuse things even further there are the operating systems own functions, that might handle other hardware pieces like the storages mentioned before.

 

Now putting this all together: Many thing on the palette are just operators, native ones that might be 3-4 CPU cycles to execute without the overhead of calling another function.

Calling functions, or in our case subVIs or OS libraries is going to take time for allocation if not already in the memory (dynamic call of the code, or even conditional execution like a case structure), plus the overhead of the function call with the stack, plus the data transfer between cores when required.

 

Although LabVIEW is inherently parallel, this is also a dangerous proposition: it can inherently create way more threads prone for data allocation and sharing overhead than managing the threads carefully and setting parallel execution only where it is needed.

 

BTW, since the OS manages the memory allocation the most CPU effective apps, and the ones that execute quickest are which can avoid memory reallocations. This is why VIs with short execution time, like the string end normalization are performing better when non-reentrant or in-lined. Making preallocations for it actually decreases performance.

 

Also very large preallocated code in conditional execution can be dynamically allocated, which again, might render the operation slower then a non-reentrant mutexing solution.

 

Bottom line: as others already said, if there is a problem, measure it and fix it. If no problem then no fix is needed.

Petru_Tarabuta
Active Participant

"Many thing on the palette are just operators, native ones that might be 3-4 CPU cycles to execute without the overhead of calling another function." - I agree that a sizeable subset of primitives are probably mapped directly to CPU instructions. Basic math primitives such as Add, Subtract, Multiply, Divide, perhaps Square and Square Root translate directly to a single CPU instruction each. Basic Boolean primitives such as And, Or, Exclusive Or translate to a single CPU instruction each. Some comparison functions such when used with primitive data types, such as comparing two I32's. Perhaps a few memory-management primitives such as Swap Values or Request Deallocation translate to a single CPU instruction each.

 

But many primitives perform complex operations and do not map to a single CPU instruction. For example, performing an Equal? operation on two complex data structures (1D array of typedef clusters). Or string manipulation primitives such as Replace Substring, Search and Replace String, Match Pattern, etc. Or collection primitives such as Insert Into Map, Remove From Map, Insert Into Set, Remove From Set, etc. Or primitives that require OS resources such as TCP Read, TCP Write, Read from Text File, Write to Text File.

"Although LabVIEW is inherently parallel, this is also a dangerous proposition: it can inherently create way more threads prone for data allocation and sharing overhead than managing the threads carefully and setting parallel execution only where it is needed." - VIs that are set to preallocated clone DO NOT force the creation of new threads. The compiler/scheduler still has control of how many threads are created. The compiler/scheduler can decide if enough threads have been created. For example, the same code may execute using 8 threads on a CPU that has 8 logical cores, and may execute using 16 threads on a CPU that has 16 logical cores. It's up to the compiler/scheduler to make this decision. What a non-reentrant VI does though is it forces parallel threads that require that VI to WAIT ON ONE ANOTHER. This introduces an interaction or a dependency between two threads that could otherwise be completely independent of one another. It introduces more work for the scheduler - it now needs to ensure that the two threads are using the same shared resource one at a time.

"BTW, since the OS manages the memory allocation the most CPU effective apps, and the ones that execute quickest are which can avoid memory reallocations." - The point of preallocated clones is that they are pre-allocated. There is no memory reallocation necessary. They are pre-allocated when the code is compiled. When the code is running the runtime engine doesn't bump into a preallocated VI and says "Oh, I didn't know you were here, I need to allocate you some memory". That's the point of pre-allocated (as opposed to shared clone) - the memory allocation is done in advance, so there are no memory reallocations needed.

"This is why VIs with short execution time, like the string end normalization are performing better when non-reentrant or in-lined. Making preallocations for it actually decreases performance." - I disagree. I don't think Normalize End of Line.vi executes better when it is non-reentrant. I think if you had two loops that both were calling that VI repeatedly you will notice that the two loops run slower when the VI is non-reentrant vs. when it is preallocated clone. I have not benchmarked this, so my opinion is based purely on deductive reasoning. Have you got benchmarks to support the statement quoted above?

fefepeto_kb
Member

"VIs that are set to preallocated clone DO NOT force the creation of new threads." I agree. Programmers who misunderstand the consequences of parallelism do create those threads. But the following is not true: "the same code may execute using 8 threads on a CPU that has 8 logical cores, and may execute using 16 threads on a CPU that has 16 logical cores", simply because the scheduler in the OS doesn't care about the number of cores. Even in times of windows XP an application might have had over 100 threads easily. With DQMH or Actor framework a LabVIEW application can easily scale to 1000 threads. It is not about how many can run in parallel on the CPU, rather about how many data sets have to be maintained separately, and unfortunately each actor will create at least 2 threads.

This also comes from the perspective of number of threads being limited to CPU cores: "What a non-reentrant VI does though is it forces parallel threads that require that VI to WAIT ON ONE ANOTHER". Assuming there are more threads then CPU cores, and there will more likely be; they have to wait on one another. It's just hidden by the OS.

"It introduces more work for the scheduler - it now needs to ensure that the two threads are using the same shared resource one at a time." Here we are discussing on who lifts the weight, the CPU scheduler, or the OS scheduler. Even if an application runs only on 6 threads, there is no guarantee that no other application will make it wait. There won't be any determinism and the potential gains by allocating a new clone for each place in code are not proportional to the number of parallel calls. If deterministic execution is a goal, then we shall move to RT.

 

From this perspective it pretty much seams that you might have a use case, which is a heavily parallelized architecture based on queued message handler pattern. And yes, there, some VIs might gain from the preallocation, but this is a niche use case compared to what other architectural patterns are available.

 

"The point of preallocated clones is that they are pre-allocated. There is no memory reallocation necessary. They are pre-allocated when the code is compiled." This simply cannot be true. If I have an encoded string sent between the actors, and I have an encoder that dynamically selects the algorithm based on who sent the message, and have 16 encryptions implemented, then all 16 algorithms will be loaded in the memory, with their data space in all 100 or more actors? No, it would prefill the memory with garbage, so they won't. LabVIEW will try to open clones up the the amount of CPU cores, and if needed even above it because of the data separation more instances will be created, dynamically. Here is an article that exactly describes the preallocated clones being created when the calling VI is loaded: Differences Between Reentrant, Template, and Dynamic VIs - NI. Unless you have all the VIs loaded at the startup of the application, which would make it very slow and all the cases preloaded, there will be dynamic allocations.

 

"I think if you had two loops that both were calling that VI repeatedly you will notice that the two loops run slower when the VI is non-reentrant vs. when it is preallocated clone." If strictly fixed for two loops, no subVI calls and the Normalize End of Line.VI is the only thing the loops call maybe. To be honest, depending on the size of the loops it can arguably be faster with a single loop. If there is only one iteration for each loop single loop will be faster. Maybe even up to 4-4 iterations. At 10-10 iterations they will be similar. Above 10 potentially the parallel loops will be faster.

 

I'm not saying preallocation doesn't make sense. I'm trying to say that we shall use the right tool for the job. But, if this VI is preallocated, then people wo call these VIs in their usual code do not benefit anything from it, indeed, their performance might deteriorate. Now they have to deal with the allocation time, because loading the calling VI loads all the clones of this subVI into the memory. What if they have a sequential code and called it in 5 places. 5 times the subVI load overhead vs 0 benefit.

 

"Have you got benchmarks to support the statement quoted above?" I made benchmarks back when I worked at NI, and also the performance oriented training contained counter arguments for preallocation, based on measurement. Unfortunately that training is gone. But I sustain that every use case is different, and have the feeling of a strong bias of a heavily parallelized architecture here. I have no problems with those architectures but not everyone works in such projects and not everyone benefits from the same settings.

 

Mads
Active Participant

On a side note; when we do have a prealloced reentrant VI on a diagram in edit mode and you double-click to open it it should show the original/prototype, wheras if you are running the code the double-click should open the clone instance of that call...but this is not what happens (LV 2024); it always open a clone and you have to hit Ctrl+M to get to the original...tedious.

 

I was sure there was a shortcut to avoid this and just go straight to the original....like Ctrl+Double Click very conveniently gets you straight to the block diagram e.g., but I seem to have imagined that(?). If I am correct, there are two helping ideas here; 1) Always go to the original on double-click while not running and/or 2) add a key+double-click shortcut to get to the original in any case...

wiebe@CARYA
Knight of NI

>On a side note; when we do have a prealloced reentrant VI on a diagram in edit mode and you double-click to open it it should show the original/prototype, wheras if you are running the code the double-click should open the clone instance of that call...but this is not what happens (LV 2024); it always open a clone and you have to hit Ctrl+M to get to the original...tedious.

 

Opening the clone has it's use cases.

 

The clone does retain state so it is helpful when debugging after the main finished execution.

 

There is a way (tedious, indeed) to switch from clone to original, but if the original would open, there wouldn't be any way to go back to the clone.

 

I think this is by design and I think there will be complaints if this got 'fixed'.