Replace array subset in parallel for loop without increasing memory

altenbach · ‎04-18-2018

It would help to design an artificial dataset with only a few dozen points that completes in an instant but allows testing of correct functionality.

One things you constantly do is deal with distances in 2D, so instead of arrays of clusters of two DBLs, you could use plain 1D complex arrays (CDB) for the points. Now the distance calculations is trivial by just subtracting two complex values then taking the absolute to get r. (no squares, no roots, simpler datatypes!) I did a very quick literal rewrite yesterday and got about a 2-3x speedup. (Sorry, I don't currently have access to that computer).

LabVIEW Champion.

altenbach · ‎04-18-2018

@altenbach wrote:

I did a very quick literal rewrite yesterday and got about a 2-3x speedup. (Sorry, I don't currently have access to that computer).

OK, Here's my code (LabVIEW 2017, sequential and parallel versions).

My timing (YMMV):
Your code: ~6 minutes
My sequential code: ~2 minutes
My parallel code: ~40 seconds

I don't know how the memory uses differ.

Except for the conversion to complex, I stayed close to your code. I am sure the inner FOR loop could be optimized further to operate more in-place, but it would need more code.

Here's a picture of the parallel version

LabVIEW Champion.

altenbach · ‎04-18-2018

@altenbach wrote:
I am sure the inner FOR loop could be optimized further to operate more in-place, but it would need more code.

This is actually quite simple to do in-place. Down to 4 seconds, or 10x faster than before or >100x faster than original code. I will post it tomorrow to give others a stab at it. 😧

LabVIEW Champion.

tshurtz · ‎04-18-2018

/@altenbach,

Thanks a ton. This is great.

I worked earlier on some of these suggestions and here is what I got.

On my machine my code runs in ~17 seconds with the random data. About 14 seconds with the real data. I am not sure why the real data is faster (it is highly ordered to begin with and therefore it may not need to switch as many values?)

Your parallel code runs in ~30 seconds on my machine. So my attempt at in-placeness was about 2x faster than your old method but still a lot more than your promised 4 sec. I can't wait to see how you did it. I love the divide by 2 and multiply by two you have by the way.

I ran the profiler on our codes and acording to it your code used 18% more than I was seeing before memory wise so not huge. Mine was just over 50% of the previous. I hope your 4 second version is similarly less. but 18% more ram isn't bad for the seedup.

I changed my algorithim a bit in a way that doesn't effect run time much but I think gives a little better solution for real data. Instead of starting from 0,0 I start with the first array element. Also I realized that with in place method, I don't have to run the first or last inner loop iterations. the first because I am starting with that element so I just leave it. The last b/c when I run the second to last iteration I actually define the last two points order.

One significant improvement in speed was to turn off debug mode 3-4x speedup. I was not completely surprised by this but it got me thinking. Does built code behave differently with debugging allowed or not? Does it slow down significantly as well? If so what is the added benefit to build that way? How do you utilize run time debugging on built labview code?

Thanks again!

Kevin_Price · ‎04-19-2018

It was a real eye-opener that so much speedup resulted from the Complex # representation of coordinates instead of a cluster of x and y Reals.

I also eagerly await altenbach's big reveal. He got to about 90x improvement while I topped out at just over 20x after using his Complex # representation suggestion (~325 down to 15 sec). Our contributions from For Loop parallelization were pretty similar, if anything mine might have been a bit better. Meaning that the *rest* of my algorithm was worse by a greater margin.

Meanwhile, you may want to look a more sophisticated, less brute-force method. (I haven't actually modified it to fit your problem yet. I don't know whether it was designed to scale to the much larger # of points you're dealing with. But I know the guy who writes the blog is the real deal, so I'm optimistic that it's at least worth a look.)

-Kevin P

P.S. Your random initialization isn't actually right, though since all the coords are random anyway it probably doesn't matter a whole lot. Instead of building up "previous pt B to next pt A" distances, you're building up "pt B to pt B" distances.

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

tshurtz · ‎04-19-2018

@Kevin_Price

@Kevin_Price
P.S. Your random initialization isn't actually right, though since all the coords are random anyway it probably doesn't matter a whole lot. Instead of building up "previous pt B to next pt A" distances, you're building up "pt B to pt B" distances.

Good catch. It would only make a difference if the algorithm decided not to replace b/c it was not any better. highly unlikely in this case.

altenbach · ‎04-19-2018

Sorry, I don't have the code on my current laptop, but you now changed the actual data representation to be two complex arrays, which makes it now directly parallelizable. I was still operating on the interlaced single complex array as in the original code. Of course the fast time was on a 16 core Xeon.

The new data format might also help my algorithm. I was mostly focused to do the inner search steps in-place.

And yes, the first thing I did was create a small constant input array to verify the algorithm. It is easy to accidentally change the outcome by making a coding mistake and you would never noticed with the random data.

LabVIEW Champion.

tshurtz · ‎04-19-2018

@altenbach

I would be more interested in maintaining the original data type if it runs fast. I am surprised that your parallel code seemed to run a bit faster on my computer with only 2 real cores and 4 virtual (i7-4600m) and 16 GB of ram.

I am assuming your reported times are the indicator output and not including the random number generation portion?

altenbach · ‎04-19-2018

OK got the very rough code draft.

I recently got a Ryzen 1700x and it completes it in slighly more than 5 seconds. These new AMDs are great! 😄

I kept it relatively basic for simplicity and some things could be tweaked and optimized a little. May need a + 1 or -1 here or there.... Note that there is probably quite a bit of slack left, but this should get you started.

The way it does the in-place search/sort is:

Find the shortest distance using a FOR loop. (This allows starting the min search at any given index).
Swap the shortest with the first element and remember the endpoint
find the shortest from the last endpoint, starting with the second element
Swap the shortest with the second element
repeat for all elements.
At the end, the input array now in the optimal order.

I have a few ideas to optimize further... 😄

LabVIEW Champion.

altenbach · ‎04-19-2018

@tshurtz wrote:

I am assuming your reported times are the indicator output and not including the random number generation portion?

I can try testing again later. My timing is from the indicator and does not include the data generation. It does include the final conversion from complex array to array of points, which it probably should not. 😄

LabVIEW Champion.

LabVIEW

Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory

Re: Replace array subset in parallel for loop without increasing memory