OK, I quickly looked at it again over my morning coffee and there are two improvements that give at least 20% combined speedup.
(1) Major: Change the indexing in the inner loop to avoid the second indexing of the rotated array(see attached).
(2) A minor improvement seems to be gained if the inverse tangent is done after the transposition (saving one transpose operation).
Since these arrays are only for display, you might ask yourself if it is sufficient to do all calculations in single precision (SGL).
On my rig, it gives another 40% speedup and you still get about 6 good significant digits. This is plenty for e.g. 16bit grayscale!
🙂The attached modification implements the above mentioned two improvements but is also set to SGL. It can easily be switched back to DBL by changing the representation of the single diagram constant (zero) in the outer loop and the representation of the array indicators.
Combined, the attached solution in SGL mode is about 2.25x faster that my earlier solution above. I am sure quite a few more things could be tweaked to improve the speed.