matrix performance

Wolfgang · ‎07-28-2011

Hello,

so far I have used the notation matrix [ i ] [ k ] in 2D array calculations. Now I came across a different notation, which replaces the 2D array by a 1D array, and accordingly addressing the individual elements via matrix [ i * j + k ].

I am wondering if the second approach is more efficient? Obviously, dynamic memory allocation is simpler (and probably faster, too) in the vector case, but what about accessing the elements? Memory allocation is done only once, while in large arrays there may be millions of individual element operations...

Thanks for sharing your expert knowledge!

Wolfgang

menchar · ‎07-28-2011

I'd have to think it depends upon how you use the matrix. Locality of reference may be more important than the pointer calculation.

The 2D array is stored in row major order, so references along the row (across the columns if you will), and to neighboring rows, are fast, with fewer cache misses. Similarly, the 1D is in linear order, so nearby references are going to be quicker since the nearby location is more likely in cache.

If the array access is in general randomly distributed across the matirx, I suspect it won't make much if any difference which matrix form you use, your app will spend more time moving data from main memory to/from the cache than it will doing the pointer calculations.

I implemented a 2D and accessed it "against the grain" (along columns) and it was dramatically slower than access in row major order. But then, I was sequencing across all the matrix elements in every case (no random access) so it was important to get it right. And compiler selection was important, the Visual Studio C++ compiler was the best at the job. The Intel C++ compiler would have been better if I had been able to use the vector instructions but CVI linker won't support 16 byte alignment.

If you are doing full matrix operations with sequencing through all matrix elements, you've got some other options for performance. If your operation can be partitioned across the matrix, you can use multiple threads to process different sections of the matrix concurrently.

I'm assuming when you say matrix you mean a C array of some numeric type, rather than some special data structure CVI supports - come to think of it, CVI does support matrices - but these may be simple C arrays under the hood anyway.

Wolfgang · ‎07-29-2011

Thank you for your thoughts!

This 'optimization' stuff is pretty new to me, but the 2D arrays I am dealing with may be as large as 25000 x 13 elements so I considered it worthwile spending a few minutes...; I have to repeatedly transpose it and multiply it with another matrix... so it involves a few accesses. So far my matrices are stored using the normal C convention, i.e. in row major order.

Digging into it I came across optimizations for 'small' and 'large' data sets, for the same reason you mention: cache size. So far, however, I am not sure about 'small': are you talking about the ~4k L1 cache or the ~1M L2 cache? Certainly data will not fit into L1...

Thanks again,

Wolfgang

nickb · ‎07-29-2011

Hello menchar -

I just thought that I'd mention that as of LabWindows/CVI 2010, the Intel vector instructions should work without problem for you. If you're able to try it, I'd love to hear how it works for you.

NickB

National Instruments

menchar · ‎07-29-2011

Nick -

That's good news, the newer Intel uPs can do double operations concurrently with the vector instructions - Intel's contribution to DSP on a general-purpose uP it seems. I had asked for 16 byte alignment, thanks for doing it. That project has been completed, as it was I was able to reduce execution time of one of our algorithms (which runs on 4 GB data sets) from 2 hours to 2 minutes. Main memory size, system backing file size, and disk performance were also critical - we were using a RAID, but, counterintuitively, I had to reduce the amount of data I read into memory from disk at any one time to prevent the OS from simply spooling the just-read data right back out onto the backing file on disk due to main memory limitations. I think I would have been able to shave another 30 - 40 seconds off with the vector instructions, and had I used multiple threads to process through the data it would have been much faster.

I had said 2D arrays, but now that I think about it, they were 3D arrays (data cubes if you will). And Wolfgang may not have his data on disk but in main memory already, newer MB's can host 100GB or more of physical memory.

The MB's / chipsets / uP's all are a bit different when it comes to cache size and memory access latency. But, the whole scheme is set up to support locality of reference: i.e. it anticipates that the next memory access will be close to the last one most of the time. This makes more sense perhaps with non-OO languages, but then, that's what CVI is.

The pricier Intel uP's have more cores and bigger L1 and L2 caches. I think the bigger L2 caches are 8 MB or so. Some Intel uP's have large L3 caches too.

I suspect the cost of a main memory access from to/from L2 cache is much higher than L1 - L2 accesses, but I suppose they are overlapped to some extent. I think Wolfgang you'd just have to experiment, but there may not be much you can do, unfortunately, if you have essentially random accesses over a large array: caching won't help because you'll be trying to access data that's too far away to have been previously brought into cache.

And if you have multiple threads / cores working on the same data, you can get cache coherence issues that will slow everything down too unless you can keep them out of each other's hair (i.e. the data can be partitioned).

The debate of course for the future is general purpose uP Vs. high core count GPUs. But the GPU's are designed to crunch through data sequentially I think, as in video rendering where you're doing every pixel in a frame of video.

With DSP prgramming, we used to try and localize the processing and get the relelvant data into SRAM cache, which really speeds things up, and flow the data in the same direction across the DSP - but, again, this was all based on locality of reference.

LabWindows/CVI

matrix performance

matrix performance

Re: matrix performance

Re: matrix performance

Re: matrix performance

Re: matrix performance