04-13-2018 12:46 PM
How is an array of clusters of arrays smaller than a single array of values and an array of lengths?
The recommendation isn't only about the total memory consumed by the data. There are other advantages to the "array of clusters containing arrays of data."
Internally, this will probably be stored as something like a single array of pointers, each pointer points to a distinct individual array. (Possibly there's another pointer layer for the cluster container, but I'm guessing LabVIEW can optimize that away.) As a result, you no longer need a single big block of *contiguous* memory in order to hold all the data. Instead you have a whole bunch of smallish, independent chunks. *THIS* factor is quite likely to be a *big* help for the problem where you say, "if I am not careful I crash the program because the computer can’t allocate enough to Labview."
I can't speak for the LabVIEW compiler internals, but the processing you show *seems* to make it sufficiently clear for LabVIEW to work in-place on the arrays. So I wouldn't count on an efficiency improvement due to in-placeness. However, in my mind at least, it's at least conceivable that an "array of clusters of arrays" might more easily benefit from For Loop parallelization. (... and reviewing the thread, it looks like GerdW had the same thought first in msg #2).
-Kevin P
04-13-2018 12:54 PM
@altenbach wrote:
(I have not benchmarked if parallelization really gives a speedup. Not sure about the overhead of that concatenating tunnel ;))
I get about a 3-4x speedup comparing a parallel FOR loop with a regular FOR loop using the otherwise identical above code.
04-13-2018 01:30 PM - edited 04-13-2018 01:38 PM
altenbach, the real data is not an array of doubles, but rather an array of x-y points (clusters of 2 doubles each), and to make things more complex they are actually point pairs just alternating point a1, point b1, point a2, point b2 etc. each sub array is a layer of points and I want to sort these only within a layer. The internals of the loop are not a simple sort but a rather complex sort where I start with b1 and find the next closest a point in x-y space (so the min(sqrt((Axn-Bx1)^2+(Ayn-Bx1)^2)). I loop through starting with an input point pair, a,b and look for the next point pair where the new a point is the shortest distance from the b point. I remove the point pair found from possible future solutions, then repeat until I run out of point pairs. All of the data goes back into the outer most array in the same memory locations just re-arranged.
I think there will be a benefit from palatalization because I am running hundreds of these layers with thousands of points per layer. It takes a few minutes to run normally. I have 4 cores to run on some platforms, 8 on others.
This is not the only large set of data that I am hauling around in my program and not the only thing pushing me over memory limits. But I am often skirting the limits with any of these large data sets.
I have included all of the code I am actually running in this process to see what you think of the whole thing. its the outer loop I was specifically looking to make parallel but any improvements would be great.
I am running LV 2017 by the way if that changes any tools I might be able to use.
Thanks again
04-13-2018 03:37 PM
Just to make sure we are on the same page, your complicated code with the pyramid of IPEs (with parallelism removed) performs nearly identically (within a few %) to my much simpler version (also, with parallelism removed).
04-13-2018 04:02 PM
@Altenbach - I believe you are correct.
I modified my original code to get rid of parallelization and remove some of the IPE structures.
See below.
AnotherTry.vi
I also modified Altenbach's Code to include a sort array.
See below.
ParallelTest.vi
Did not extensively test nor bench mark, but attached are results for the VI profiler. The IPE saves memory if that is important to the OP.
mcduff
04-13-2018 04:08 PM
@mcduff wrote:
I also modified Altenbach's Code to include a sort array.
What was wrong with the "sort array" that was already there??? 😮
04-13-2018 04:13 PM
@Altenbach - Sorry. I am quite tired and did not see the sort array in your original image, much like the gorilla walking by the kids bouncing the ball.
mcduff
04-13-2018 04:28 PM
Also, your "another try" performs nearly identically to my sequential version. My parallel version is still 3-4x faster.
04-13-2018 04:35 PM
What about memory usage? You got to throw me some table scraps.
mcduff
04-13-2018 05:05 PM
Well, I don't know, but it seems the IPE is not as optimized as you think. For example if you unroll the loop completely as follows (still the same result), it does not speed up over your code, even though all sorting operations can execute in parallel ( ... and I am currently on a dual Xeon with 16 hyperthreaded cores (32 virtual cores).
I am not sure about memory use but the LabVIEW compiler is typically good at keeping things in-place as much as possible, even without the IPE structure.
(Of course this is not really scalable, just wired up as a test scenario)