02-03-2019 01:28 PM - edited 02-03-2019 01:34 PM
Hello,
Do you have suggestions to improve the code below of the computation of the unit vectors?
ie. unit_V = V/norm(V)
In this code, each row of the array (with random values) represents the 3 components x-y-z of a vector to normalize.
My first idea was used an another implementation of the reciprocal square root (https://forums.ni.com/t5/LabVIEW/Fast-Reciprocal-Square-Root-with-Labview/m-p/3889359) but it's not faster with labview.
02-03-2019 01:35 PM
In the attachment, the algorithm that I searsh to improve.
02-03-2019 03:21 PM
Looks pretty good.
(One thing I would change is not wire the P terminal and not count the CPUs. Same thing, less code ;))
02-04-2019 03:58 AM
02-04-2019 07:24 AM - edited 02-04-2019 07:40 AM
10 times faster for the In Place Element Structure into the parallelized loop. It's not suprising.
If I don't connect the P terminal, the performance is significantly degraded. I don't understand why, although I configured the loop parallelism on "Automatically partition iterations". I noted this behaviour many times, so I always connect the terminal.
Thanks a lot for your help, I hoped for a better solution because I have to process around 2*10^7 vectors for a software in vision with a "Live" mode
02-04-2019 11:29 AM
@Ubik) wrote:
10 times faster for the In Place Element Structure into the parallelized loop. It's not suprising.
Yes, loop-free without the IPE seems always much slower. Another slow solution is "unit vector" with the advantage of very simple code (it does much more, so it'll be slow).
@Ubik) wrote:If I don't connect the P terminal, the performance is significantly degraded. I don't understand why, although I configured the loop parallelism on "Automatically partition iterations". I noted this behaviour many times, so I always connect the terminal.
This has not been my experience. Unwired should be identical according to the help.
@Ubik) wrote:
Thanks a lot for your help, I hoped for a better solution because I have to process around 2*10^7 vectors for a software in vision with a "Live" mode
Well, calculate how fast it is per vector. Does it really need to be DBL?
04-04-2019 06:53 PM - last edited on 04-05-2019 10:47 AM by Kristi_Martinez
The performance discrepancy described here between the ParFor implementation and the "vectorized" (or "polymorphic primitive") implementation was brought to my attention and here is my explanation about what is going on:
TL;DR: the ParFor is faster because it manages to avoid copying the contents of the 2-D array. The vectorized implementation suffers from two copies: first stripping the columns out into three 1-D arrays (contiguous pieces of memory), then reconstructing the 2-D array after the vectorized operations.
Details: I hope the annotations on these pictures will suffice for the details. You will see in the vectorized implementation that I have made the first data copies explicit with the "Always Copy" primitives. Removing these primitives does not prevent the copies; LabVIEW will just put implicit copies just before the vectorized operations. I made them explicit in frame 2 so that I could measure the cost.
I also attached the VI (v 2018).
04-07-2019 04:57 PM - edited 04-07-2019 05:15 PM
Since NI is looking at this thought I chime in with the "worst" solution to ask for an explanation.
It is an in-place structure, vector implementation hybrid. It suffers from the most buffer copies. The question is why?
Look at the screen shot for the buffer allocations they are everywhere! Why is that? In this case, in-place is terrible; maybe pass along to the compiler team.
2015 version attached.
A buffer dot on everything!
EDIT: Sometimes there are two buffer dots on operations, like multiply!!