06-06-2025 07:43 PM
Hi all, I'm looking for ways to speed up a pretty large array manipulation. What I need to do is:
1. Input array with 1088 rows and 524,000 columns
2. Re-order each column. In my example the order is just reversed, but the actual order is 0, 1, 544, 545, 2, 3, 546, 547 ... and so on.
3. Concatenate all columns together.
On my PC with LV 2016 64-bit this takes ~2 sec to run, which is only 3.5 ns for each index array call. But I'm wondering if it could be done faster.
Thank you
06-06-2025 09:47 PM - edited 06-06-2025 10:32 PM
I believe this gives the same output and is faster. I am sure there is a lot of slack left...
(nevermind, doing some test)
As a first step, disable debugging and parallelize one of the FOR loops. (inner loop seems faster in my testing, but no guarantees)
06-07-2025 01:31 PM - edited 06-07-2025 01:36 PM
EDIT Strike this, misread the post, thought you were just rearranging columns, but realized order also.
06-07-2025 01:40 PM - edited 06-07-2025 01:45 PM
@Gregory wrote:
On my PC with LV 2016 64-bit this takes ~2 sec to run, which is only 3.5 ns for each index array call. But I'm wondering if it could be done faster.
Not sure why you included array allocation in the benchmark, but anyway, if I were to tackle this problem, I would probably do something like this:
#include "extcode.h"
/* lv_prolog.h and lv_epilog.h set up the correct alignment for LabVIEW data. */
#include "lv_prolog.h"
typedef struct {
int32_t dimSizes[2];
uint16_t elt[1];
} TD1;
typedef TD1** TD1Hdl;
typedef struct {
int32_t dimSize;
int32_t entry[1];
} TD2;
typedef TD2** TD2Hdl;
typedef struct {
int32_t dimSize;
uint16_t elt[1];
} TD3;
typedef TD3** TD3Hdl;
#include "lv_epilog.h"
//==============================================================================
// Sequential writing - single thread
//
extern "C" __declspec(dllexport)
void fn_seq_wr_st(TD1Hdl Src, TD2Hdl Reorder, TD3Hdl Dst)
{
uint16_t* src=(*Src)->elt;
int32_t* reorder = (*Reorder)->entry;
int rows = (*Src)->dimSizes[0];
int cols = (*Src)->dimSizes[1];
size_t size = rows * cols;
NumericArrayResize(uW, 1, (UHandle*)&Dst, size);
(*Dst)->dimSize = (int32_t)size;
uint16_t* dst = (*Dst)->elt;
int idx = 0;
for (int i = 0; i < cols; i++) { //524000
for (int j = 0; j < rows; j++) { //1088
dst[idx++] = src[reorder[j] * cols + i];
}
}
}
Then call it as simple as
In general, there are two major strategies for such reordering: you can read in the 'native' order and then write to the destination sequentially, word by word, as shown above; or, conversely, you can read sequentially but write according to the required order. To achieve this, I need to reorder your 'reorder' array first, something like that:
//==============================================================================
// Sequential reading - single thread
//
extern "C" __declspec(dllexport) void fn_seq_rd_st(TD1Hdl Src, TD2Hdl Reorder, TD3Hdl Dst)
{
uint16_t* src=(*Src)->elt;
int32_t* reorder = (*Reorder)->entry;
int32_t* reordered = (int32_t*)malloc(((*Reorder)->dimSize) * sizeof(int32_t));
if (!reordered) return;
int rows = (*Src)->dimSizes[0];
int cols = (*Src)->dimSizes[1];
size_t size = rows * cols;
for (int i = 0; i < rows; i++) reordered[reorder[i]] = i;
NumericArrayResize(uW, 1, (UHandle*)&Dst, size);
(*Dst)->dimSize = (int32_t)size;
uint16_t* dst = (*Dst)->elt;
int idx = 0;
for (int j = 0; j < rows; j++) { //1088
int dst_row = reordered[j];
for (int i = 0; i < cols; i++) { //524000
dst[dst_row + i * rows] = src[idx++];
}
}
}
From a multithreading perspective, it's much easier to do this inside the DLL rather than in LabVIEW. The code is below:
Sequential writing:
//==============================================================================
// Multithreaded functions for sequential writing and reading
//
// Common Thread context structure to pass parameters to threads:
typedef struct {
int start_col;
int end_col;
int start_row;
int end_row;
int cols;
int rows;
uint16_t* src;
int32_t* reorder;
uint16_t* dst;
} ThreadContext;
//==============================================================================
// Sequential writing thread
//
DWORD WINAPI thread_seq_wr(LPVOID param) {
ThreadContext* ctx = (ThreadContext*)param;
int idx = ctx->start_col * ctx->rows;
int rows = ctx->rows;
int cols = ctx->cols;
uint16_t* src=ctx->src;
uint16_t *dst = ctx->dst;
int32_t* reorder = ctx->reorder;
for (int i = ctx->start_col; i < ctx->end_col; i++) {
for (int j = 0; j < rows; j++) {
ctx->dst[idx++] = src[reorder[j] * cols + i];
}
}
return 0;
}
//==============================================================================
// Sequential writing - multi thread
//
extern "C" __declspec(dllexport) void fn_seq_wr_mt(TD1Hdl Src, TD2Hdl Reorder, TD3Hdl Dst, int num_threads)
{
uint16_t* src=(*Src)->elt;
int32_t* reorder = (*Reorder)->entry;
int32_t* reordered = (int32_t*)malloc(((*Reorder)->dimSize) * sizeof(int32_t));
int rows = (*Src)->dimSizes[0];
int cols = (*Src)->dimSizes[1];
size_t size = rows * cols;
for (int i = 0; i < rows; i++) reordered[reorder[i]] = i;
NumericArrayResize(uW, 1, (UHandle*)&Dst, size);
(*Dst)->dimSize = (int32_t)size;
uint16_t* dst = (*Dst)->elt;
if (num_threads < 1) num_threads = 1;
if (num_threads > rows) num_threads = rows;
HANDLE* threads = (HANDLE*)malloc(num_threads * sizeof(HANDLE));
ThreadContext* contexts = (ThreadContext*)malloc(num_threads * sizeof(ThreadContext));
if (!threads || !contexts) return;
int chunk = cols / num_threads;
int remainder = cols % num_threads;
int col = 0;
for (int t = 0; t < num_threads; t++) {
int start_col = col;
int end_col = col + chunk + (t < remainder ? 1 : 0);
contexts[t].start_col = start_col;
contexts[t].end_col = end_col;
contexts[t].cols = cols;
contexts[t].rows = rows;
contexts[t].src=src;
contexts[t].reorder = reorder;
contexts[t].dst = dst;
threads[t] = CreateThread(NULL, 0, thread_seq_wr, &contexts[t], 0, NULL);
col = end_col;
}
WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
for (int t = 0; t < num_threads; t++) CloseHandle(threads[t]);
free(threads);
free(contexts);
}
Or sequential reading:
//==============================================================================
// Sequential reading thread
//
DWORD WINAPI thread_seq_rd(LPVOID param) {
ThreadContext* ctx = (ThreadContext*)param;
int rows = ctx->rows;
int cols = ctx->cols;
int idx = ctx->start_row * cols;
for (int j = ctx->start_row; j < ctx->end_row; j++) {
int dst_row = ctx->reorder[j];
for (int i = 0; i < ctx->cols; i++) {
//ctx->dst[dst_row + i * ctx->rows] = ctx->src[j * ctx->cols + i];
ctx->dst[dst_row + i * ctx->rows] = ctx->src[idx++];
}
}
return 0;
}
//==============================================================================
// Sequential reading - multi thread
//
extern "C" __declspec(dllexport) void fn_seq_rd_mt(TD1Hdl Src, TD2Hdl Reorder, TD3Hdl Dst, int num_threads)
{
uint16_t* src=(*Src)->elt;
int32_t* reorder = (*Reorder)->entry;
int32_t* reordered = (int32_t*)malloc(((*Reorder)->dimSize) * sizeof(int32_t));
int rows = (*Src)->dimSizes[0];
int cols = (*Src)->dimSizes[1];
size_t size = rows * cols;
for (int i = 0; i < rows; i++) reordered[reorder[i]] = i;
NumericArrayResize(uW, 1, (UHandle*)&Dst, size);
(*Dst)->dimSize = (int32_t)size;
uint16_t* dst = (*Dst)->elt;
if (num_threads < 1) num_threads = 1;
if (num_threads > rows) num_threads = rows;
HANDLE* threads = (HANDLE*)malloc(num_threads * sizeof(HANDLE));
ThreadContext* contexts = (ThreadContext*)malloc(num_threads * sizeof(ThreadContext));
int chunk = rows / num_threads;
int remainder = rows % num_threads;
int row = 0;
for (int t = 0; t < num_threads; t++) {
int start_row = row;
int end_row = row + chunk + (t < remainder ? 1 : 0);
contexts[t].start_row = start_row;
contexts[t].end_row = end_row;
contexts[t].cols = cols;
contexts[t].rows = rows;
contexts[t].src=src;
contexts[t].reorder = reordered;
contexts[t].dst = dst;
threads[t] = CreateThread(NULL, 0, thread_seq_rd, &contexts[t], 0, NULL);
row = end_row;
}
WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
for (int t = 0; t < num_threads; t++) {
CloseHandle(threads[t]);
}
free(threads);
free(contexts);
}
Test all four variants—single-threaded and multithreaded, with both sequential read and write. All right, everything works as expected:
Results. I benchmarked the only multithreaded implementation on two different PC.
On the 4 cores i7-4940MX best result with four threads — 1023 MB/s or 1,11 s.
An interesting point is that sequential writing runs significantly faster than reading.
On the more powerful dual Xeon Gold system (with 2×16 cores, or 64 logical CPUs), the results are slightly better (even with lower CPU Freq - 2,30 GHz only vs 3,10 on i7). The best performance is achieved at 31 threads, reaching 1.67 GB/s or 680 ms:
But on the dual Xeon system, I have NUMA, which means the RAM is divided into two banks, each dedicated to a specific CPU. The best results can be achieved only if I allocate memory on the appropriate bank and then run the workload on the corresponding CPU. Accessing the "wrong" bank from a CPU is not prohibited, but it incurs significant penalties. In general, this PC is equipped with 12-channel RAM, so I can expect a little bit more performance (but it depends).
Something like that. The project is in the attachment: LabVIEW 2025 Q1 v25.1.2f2 (saved as 2016) + Visual Studio 2022 v17.14.4. A 32-bit version is provided, but it is out of scope due to running out of memory.
06-07-2025 04:46 PM - edited 06-07-2025 04:48 PM
By the way, if array allocations are removed from the benchmarked area (for both LabVIEW and the DLL, keeping only the reordering), the results look much better.
Additionally, some minor optimizations can be made. For example, we can precompute offsets from the reorder array
for (int i = 0; i < rows; i++) offsets[i] = cols * reorder[i];
Then use these offsets in the reordering loop to save multiplications.
ThreadContext* ctx = (ThreadContext*)param;
int idx = ctx->start_col * ctx->rows;
int rows = ctx->rows;
int cols = ctx->cols;
uint16_t* src=ctx->src;
uint16_t *dst = ctx->dst + idx;
for (int i = ctx->start_col; i < ctx->end_col; i++) {
for (int j = 0; j < rows; j++) {
*dst++ = src[offsets[j] + i];
}
}
This is 8 threads i7-4940MX:
LabVIEW - 1,62 s; DLL - 680 ms.
And this is 64-threads Dual Xeon:
The best LabVIEW time here was 880 ms at 16 threads, but the DLL achieved 47.5 ms, which means a processing speed of more than 20 GB/s. This is a more or less expected result on that PC. And also interesting performance drop after 32 threads, but this is another story.
06-08-2025 12:31 AM - edited 06-08-2025 12:35 AM
And finally, just one more thing: from this point, it makes sense to use Intel VTune Profiler and check for hotspots. Just turn the benchmark into a simple app and run it — you will see immediately:
And inside of this function (cool thing is that we have source here):
Or in Assembly, if preferred:
The result is obvious. Rest time in LabVIEW, right here:
this is a copy of the destination to the output wire (not the fastest method, by the way).
Some further optimizations are possible — for example, loop unrolling (Intel compiler can to that automatically), or maybe using AVX2 (though _mm256_i32gather_epi32
may give some boost, but you are working with U16, which is less ideal here). Anyway, squeezing out every next millisecond will get harder and harder. Usually, it only makes sense to do reasonable optimizations and not overdo it, otherwise, code maintenance will turn into hell.
06-09-2025 11:45 AM
Hi guys, thank you very much for the suggestions.
I took the arrays out of the benchmarked frame and the processing time was about 1.8 sec. Disabling debugging got it to 1.3 sec, and parallelizing the outer loop with 8 instances got it down to 0.8 sec. This may be good enough for now... Parallelizing the inner loop actually raised the time to ~4.5 sec, so I'm not sure why I'm seeing the opposite of altenbach. I think I can continue with this for now, and if it becomes an issue I'll return to Andrey's very thorough DLL examples 🙂
06-09-2025 04:36 PM
@Gregory wrote:
, so I'm not sure why I'm seeing the opposite of altenbach. I think I can continue with this for now, and if it becomes an issue I'll return to Andrey's very thorough DLL examples 🙂
Must have been a fluke (no, not the instrument company!) on my side, because I can no longer reproduce it. 😄 But yes, parallelizing the outer loop gives me about a factor 3-4x faster. My laptop is set to "balanced" and I might be a factor of two between runs of the same code.