How to control parallel threads?

Novgorod · ‎11-07-2019

Since "parallel" execution is naturally baked in to Labview, we are supposed "not to worry about it" and let the compiler do the job of distributing pieces of code onto actual OS threads, which then get handled by physical CPU cores. However, there are some performance-critical situations where I would like to have more control or at least some consistent predictability of the "threading" of a particular piece of code.

I assumed that code sections without data dependencies between each other always execute as different threads in order to benefit as much as possible from multicore CPUs. This is apparently not the case. It is true that these sections execute "in parallel" but only in the sense that it's not possible to predict which section would win a race condition - they could be still executed sequentially in a single thread (albeit in a "random" sequence). Here is an example of a simple benchmark:

The exponentiation of a large array is computationally demanding, so I would expect Labview to use 4 threads, but the execution of this code takes exactly 4 times as long as the execution of only one instance of the exponentiation function. I tested it on a quad-core CPU, of course, and the CPU usage in the task manager also showed just one core being used. Now let's test the same example but use explicit for-loops instead:

This time it's actually multi-threaded and the execution time is almost as low as for a single exponentiation instance on a quad-core CPU (and the task manager shows a higher CPU load, accordingly).

Is there a way to control this behavior and force sections of code to use separate threads when the "iteration parallelism" function of the for loop is impractical or inefficient to use?

drjdpowell · ‎11-07-2019

Try wrapping code in a single-iteration FOR loop. I think that will guide the compiler to multi-thread.

As a note to readers, this is something you might do where you know two bits of code will both take a long time, as multi-threading fast code can cost you due to the overhead of multi-threading.

crossrulz · ‎11-07-2019

@drjdpowell wrote:

Try wrapping code in a single-iteration FOR loop. I think that will guide the compiler to multi-thread.

An In-place Element Structure might do the same thing. The goal here is to make sure the "clumping" is separate.

There are only two ways to tell somebody thanks: Kudos and Marked Solutions
Unofficial Forum Rules and Guidelines
"Not that we are sufficient in ourselves to claim anything as coming from us, but our sufficiency is from God" - 2 Corinthians 3:5

Novgorod · ‎11-07-2019

@crossrulz wrote:

@drjdpowell wrote:

Try wrapping code in a single-iteration FOR loop. I think that will guide the compiler to multi-thread.

An In-place Element Structure might do the same thing. The goal here is to make sure the "clumping" is separate.

I just tested it with the in-place element structure (replacing the For loops in the second example) and it doesn't work, the execution time just stacks. I tried using both normal tunnels for the array in/out as well as the "in place in/out element" terminals - same result. In contrast, a single-iteration For loop (like in the second example but with normal tunnels instead of auto-indexing and with N = 1) successfully forces multi-threading. I also tried replacing the For loops with While loops (with true wired to the abort terminals) and it does multi-threading only if the in/out array goes through shift registers but not through simple tunnels ...

The question is what are predictable criteria for clumps to actually use different threads? It's definitely not memory size or execution time. In my first example above, each exponential primitive takes about 50ms for a 1 million element array, but the behavior is the same even for 100 million elements where execution takes several seconds and a gigabyte or so of memory. What exactly forces multi-threading and is it reliable or does it depend on magical circumstances of the compiler?

drjdpowell · ‎11-07-2019

I note that the compiler doesn't know how large the arrays will be, thus its rules can't depend on array size.

mcduff · ‎11-07-2019

Just guessing here as always.

Maybe parallel loop the compiler uses different cores, for arrays using elementary operations, add, subtract, math functions, etc, maybe the compiler is using the vectorized instruction set in a CPU.

mcduff

altenbach · ‎11-07-2019

Instead of all that duplicate code, just use a parallel FOR loop.

LabVIEW Champion.

Novgorod · ‎11-08-2019

@drjdpowell wrote:

I note that the compiler doesn't know how large the arrays will be, thus its rules can't depend on array size.

Good point. So will a bunch of (independent) primitives on the block diagram always execute in the same thread? Is there any official documentation on the multithreading rules?

@altenbach wrote:

Instead of all that duplicate code, just use a parallel FOR loop.

Yes, that would work, and my example was just a simple benchmark for "parallel" execution of independent parts of the block diagram, not real-world code. However, using a For loop with the parallelization feature would create a lot of overhead because the arrays would have to be split and combined plus the inherent overhead of splitting the iterations and assigning them to different threads. Although the latter might be the same if you have explicit "parallel" code on the block diagram anyway, but that's just guessing.

wiebe@CARYA · ‎11-08-2019

@Novgorod wrote:

@drjdpowell wrote:

I note that the compiler doesn't know how large the arrays will be, thus its rules can't depend on array size.

Good point. So will a bunch of (independent) primitives on the block diagram always execute in the same thread? Is there any official documentation on the multithreading rules?

It depends on whether the node is marked as asynchronous or not. There's a brown scripting property node for that (read only). 4 parallel wait (ms) execute in parallel, as wait (ms) is asynchronous.

I think\expect this makes parallel clumps, and those clumps are divided evenly over threads. So how it's divided over threads might be decided at RT. But if there's only one clump, it's one thread for sure.

Search LabVIEW like a graph!

wiebe@CARYA · ‎11-08-2019

btw, see this. It's the same situation, but for .NET nodes...

Search LabVIEW like a graph!

LabVIEW

How to control parallel threads?

How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?

Re: How to control parallel threads?