04-18-2018 10:39 AM
It would help to design an artificial dataset with only a few dozen points that completes in an instant but allows testing of correct functionality.
One things you constantly do is deal with distances in 2D, so instead of arrays of clusters of two DBLs, you could use plain 1D complex arrays (CDB) for the points. Now the distance calculations is trivial by just subtracting two complex values then taking the absolute to get r. (no squares, no roots, simpler datatypes!) I did a very quick literal rewrite yesterday and got about a 2-3x speedup. (Sorry, I don't currently have access to that computer).
04-18-2018 03:20 PM
@altenbach wrote:
I did a very quick literal rewrite yesterday and got about a 2-3x speedup. (Sorry, I don't currently have access to that computer).
OK, Here's my code (LabVIEW 2017, sequential and parallel versions).
I don't know how the memory uses differ.
Except for the conversion to complex, I stayed close to your code. I am sure the inner FOR loop could be optimized further to operate more in-place, but it would need more code.
Here's a picture of the parallel version
04-18-2018 04:53 PM
@altenbach wrote:
I am sure the inner FOR loop could be optimized further to operate more in-place, but it would need more code.
This is actually quite simple to do in-place. Down to 4 seconds, or 10x faster than before or >100x faster than original code. I will post it tomorrow to give others a stab at it. 😧
04-18-2018 07:50 PM - edited 04-18-2018 08:09 PM
Thanks a ton. This is great.
I worked earlier on some of these suggestions and here is what I got.
On my machine my code runs in ~17 seconds with the random data. About 14 seconds with the real data. I am not sure why the real data is faster (it is highly ordered to begin with and therefore it may not need to switch as many values?)
Your parallel code runs in ~30 seconds on my machine. So my attempt at in-placeness was about 2x faster than your old method but still a lot more than your promised 4 sec. I can't wait to see how you did it. I love the divide by 2 and multiply by two you have by the way.
I ran the profiler on our codes and acording to it your code used 18% more than I was seeing before memory wise so not huge. Mine was just over 50% of the previous. I hope your 4 second version is similarly less. but 18% more ram isn't bad for the seedup.
I changed my algorithim a bit in a way that doesn't effect run time much but I think gives a little better solution for real data. Instead of starting from 0,0 I start with the first array element. Also I realized that with in place method, I don't have to run the first or last inner loop iterations. the first because I am starting with that element so I just leave it. The last b/c when I run the second to last iteration I actually define the last two points order.
One significant improvement in speed was to turn off debug mode 3-4x speedup. I was not completely surprised by this but it got me thinking. Does built code behave differently with debugging allowed or not? Does it slow down significantly as well? If so what is the added benefit to build that way? How do you utilize run time debugging on built labview code?
Thanks again!
04-19-2018 08:33 AM - edited 04-19-2018 08:37 AM
It was a real eye-opener that so much speedup resulted from the Complex # representation of coordinates instead of a cluster of x and y Reals.
I also eagerly await altenbach's big reveal. He got to about 90x improvement while I topped out at just over 20x after using his Complex # representation suggestion (~325 down to 15 sec). Our contributions from For Loop parallelization were pretty similar, if anything mine might have been a bit better. Meaning that the *rest* of my algorithm was worse by a greater margin.
Meanwhile, you may want to look a more sophisticated, less brute-force method. (I haven't actually modified it to fit your problem yet. I don't know whether it was designed to scale to the much larger # of points you're dealing with. But I know the guy who writes the blog is the real deal, so I'm optimistic that it's at least worth a look.)
-Kevin P
P.S. Your random initialization isn't actually right, though since all the coords are random anyway it probably doesn't matter a whole lot. Instead of building up "previous pt B to next pt A" distances, you're building up "pt B to pt B" distances.
04-19-2018 09:42 AM
@Kevin_PriceP.S. Your random initialization isn't actually right, though since all the coords are random anyway it probably doesn't matter a whole lot. Instead of building up "previous pt B to next pt A" distances, you're building up "pt B to pt B" distances.
Good catch. It would only make a difference if the algorithm decided not to replace b/c it was not any better. highly unlikely in this case.
04-19-2018 10:46 AM
Sorry, I don't have the code on my current laptop, but you now changed the actual data representation to be two complex arrays, which makes it now directly parallelizable. I was still operating on the interlaced single complex array as in the original code. Of course the fast time was on a 16 core Xeon.
The new data format might also help my algorithm. I was mostly focused to do the inner search steps in-place.
And yes, the first thing I did was create a small constant input array to verify the algorithm. It is easy to accidentally change the outcome by making a coding mistake and you would never noticed with the random data.
04-19-2018 11:24 AM
I would be more interested in maintaining the original data type if it runs fast. I am surprised that your parallel code seemed to run a bit faster on my computer with only 2 real cores and 4 virtual (i7-4600m) and 16 GB of ram.
I am assuming your reported times are the indicator output and not including the random number generation portion?
04-19-2018 11:30 AM
OK got the very rough code draft.
I recently got a Ryzen 1700x and it completes it in slighly more than 5 seconds. These new AMDs are great! 😄
I kept it relatively basic for simplicity and some things could be tweaked and optimized a little. May need a + 1 or -1 here or there.... Note that there is probably quite a bit of slack left, but this should get you started.
The way it does the in-place search/sort is:
I have a few ideas to optimize further... 😄
04-19-2018 11:40 AM
@tshurtz wrote:
I am assuming your reported times are the indicator output and not including the random number generation portion?
I can try testing again later. My timing is from the indicator and does not include the data generation. It does include the final conversion from complex array to array of points, which it probably should not. 😄