Creating a DLL to work on 2D Arrays

rolfk · ‎05-21-2024

@Andrey_Dmitriev wrote:

By the way, may I see the source of the AllocateUint16Array()? I've scrolled this topic from the top to the bottom, but unable to see it (may be I'm blind).

AllocateUint16Array() is exported by the DLL that LabVIEW creates. It is most likely similar to this code:

Uint16Array IntAllocateUint16Array(int32 *dimSizeArr)
{
    Uint16Array handle = NULL;
    MgErr err = NumericArrayResize(u16, 2, (UHandle*)&handle, dimSizeArr[0] * dimSizeArr[1];
    if (!err)
        return handle;
    return NULL;
}

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-21-2024

@rolfk wrote:

@Andrey_Dmitriev wrote:

By the way, may I see the source of the AllocateUint16Array()? I've scrolled this topic from the top to the bottom, but unable to see it (may be I'm blind).

AllocateUint16Array() is exported by the DLL that LabVIEW creates. It is most likely similar to this code:
Uint16Array IntAllocateUint16Array(int32 *dimSizeArr)
{
    Uint16Array handle = NULL;
    MgErr err = NumericArrayResize(u16, 2, (UHandle*)&handle, dimSizeArr[0] * dimSizeArr[1];
    if (!err)
        return handle;
    return NULL;
}

Ah, OK, thank you. But here is just resizing of the already allocated Handle. Well, I'll make one experiment withing coming lunch break, if time permit.

rolfk · ‎05-21-2024

There is also an exported function to resize an existing array. It is, without some possible extra checks, similar to this:

MgErr ResizeUint16Array (Uint16Array *hdlPtr, int32 *dimSizeArr)
{
    MgErr err = NumericArrayResize(uW, 2, (UHandle*)hdlPtr, dimSizeArr[0] * dimSizeArr[1]);
    return err;
}

Notice that this function can be called with hdlPtr == NULL and it will allocate a new handle instead of resizing the existing handle. In that case it is in fact equal to a call of the AllocateUInt16Array() function.

One important thing to remember is that LabVIEW internally uses NULL handles to be equal to an empty handle. If you pass a handle to a C function by value, the LabVIEW Call Library Node will always make sure to pass an explicit empty handle, since there is no possibility to return the handle when it would be created in the function. But if you pass it by reference as a pointer, LabVIEW will happily pass the internal NULL handle to the function and if your C code is not prepared to handle that, it will simply crash when trying to access the content of the handle.

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-21-2024

@rolfk wrote:

There is also an exported function to resize an existing array. It is, without some possible extra checks, similar to this:
MgErr ResizeUint16Array (Uint16Array *hdlPtr, int32 *dimSizeArr)
{
    MgErr err = NumericArrayResize(uW, 2, (UHandle*)hdlPtr, dimSizeArr[0] * dimSizeArr[1]);
    return err;
}
Notice that this function can be called with hdlPtr == NULL ...

OK, got this. It wasn't so complicated.

Let say we have VI which needs to be turned into DLL using native LabVIEW's Arrays:

Together with array I will output the size and sum of all "pixels" to be sure that everything works fine.

Now I will create two build specs - one for 32-bit and another one for x64:

in each one I will set the name as following - 32 for 32 and 64 for 64:

VI Prototype by default:

This is how this DLL called within LabVIEW (32/64 bit version selected automatically because of '*'):

Take a note, that I haven't "deallocation" - LabVIEW will take care about this.

Image Arrays passed as pointers to handles:

Also I'll put while loop to check memory leak, everything is just fine:

So far so good.

Now how it called from C. From both build spec I have two headers, and they are different, because of struct alignment:

This is 32-bit version:

#include "extcode.h"
#pragma pack(push)
#pragma pack(1)

#ifdef __cplusplus
extern "C" {
#endif
typedef struct {
	int32_t dimSizes[2];
	uint16_t Numeric[1];
} Uint16ArrayBase;
typedef Uint16ArrayBase **Uint16Array;

/*!
 * ImageIncrement
 */
void __cdecl ImageIncrement(Uint16Array *_2DImageU16In, 
	Uint16Array *_2DImageU16Out, int32_t *WidthCols, int32_t *HeightRows, 
	uint16_t *sum);

MgErr __cdecl LVDLLStatus(char *errStr, int errStrLen, void *module);

/*
* Memory Allocation/Resize/Deallocation APIs for type 'Uint16Array'
*/
Uint16Array __cdecl AllocateUint16Array (int32 *dimSizeArr);
MgErr __cdecl ResizeUint16Array (Uint16Array *hdlPtr, int32 *dimSizeArr);
MgErr __cdecl DeAllocateUint16Array (Uint16Array *hdlPtr);

void __cdecl SetExecuteVIsInPrivateExecutionSystem(Bool32 value);

#ifdef __cplusplus
} // extern "C"
#endif

#pragma pack(pop)

64-bit haven't #pragma pack.

And LabVIEW kindly created Allocation and Deallocation functions.

How they used:

#include <ansi_c.h>
#ifdef WIN64
#include "include/SharedLib64.h"
#else
#include "include/SharedLib32.h"
#endif

int main (int argc, char *argv[])
{
	Uint16Array srcImage, dstImage;
	int32 dimSizeArr[2] = {2, 3}; //rows, cols
	int Width, Height;
	unsigned short Sum;
	
	srcImage = AllocateUint16Array (dimSizeArr);
	dstImage = AllocateUint16Array (NULL); //will be resized in increment DLL

	ImageIncrement(&srcImage, &dstImage, &Width, &Height, &Sum);
	printf("ImgInc: Width = %d, Height = %d, Sum = %d\n", Width, Height, Sum);
	//Cross check:
	unsigned short test = (*dstImage)->Numeric[0];
	int height = (*dstImage)->dimSizes[0];
	int width = (*dstImage)->dimSizes[1];
	printf("Check: width = %d, height = %d, pix = %d\n", width, height, test);
	
	DeAllocateUint16Array(&srcImage);
	DeAllocateUint16Array(&dstImage);

	return 0;
}

At the beginning I will select proper header with #ifdef WIN64

Then memory needs to allocated. For source with desired size, but for destination — just handle, because array will be properly resized within DLL automatically.

After that both are deallocated (I checked with loop, there is no leakage).

Result:

D:\Image Experiment\C src>TestApp64.exe
ImgInc: Width = 3, Height = 2, Sum = 6
Check: width = 3, height = 2, pix = 1

Each 32 and 64 bit app will be linked with own library, CVI take care about this "automatically":

Theoretically, we can replace the AllocateUint16Array() and DeAllocateUint16Array() functions with direct calls to LabVIEW's memory manager, but this is out of scope, and I don't think it makes sense.

The test project is in the attachment. It has been downgraded to LabVIEW 2017, but the DLLs are from LabVIEW 2024, so you may need to recompile them if necessary. I'll leave the exercise with Python for someone else (it will work as well, for sure).

rolfk · ‎05-21-2024

You can actually combine both types of headers into a single header by doing one of these things:

#include "extcode.h"
#ifdef __cplusplus
extern "C" {
#endif
#if MSWin && (ProcessorType == kX86)
#pragma pack(push, 1)
#endif
typedef struct {
	int32_t dimSizes[2];
	uint16_t Numeric[1];
} Uint16ArrayBase;
typedef Uint16ArrayBase **Uint16Array;
#if MSWin && (ProcessorType == kX86)
#pragma pack(pop)
#endif

.......

#include "extcode.h"
#ifdef __cplusplus
extern "C" {
#endif
#include "lv_prolog.h"
typedef struct {
	int32_t dimSizes[2];
	uint16_t Numeric[1];
} Uint16ArrayBase;
typedef Uint16ArrayBase **Uint16Array;
#include "lv_epilog.h"

.......

Unfortunately the LabVIEW DLL Builder does not do this even though the according lv_prolog.h and lv_epilog.h are shipped with LabVIEW since many many moons in the same cintools directory as the also included extcode.h.

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-21-2024

@rolfk wrote:

You can actually combine both types of headers into a single header by doing one of these things:

Yes, I know this trick with prolog/epilog, but from a build architecture point of view, it may make sense to keep these headers untouched during active development when new functions are being added again and again.

Just one more thing — there is a function called SetExecuteVIsInPrivateExecutionSystem(). Here is some documentation on it: Characteristics of LabVIEW-Built Shared Libraries.

Gregory · ‎05-21-2024

Thank you for the incredible amounts of detail in this discussion!

CLA // BALUG // Unofficial Forum Rules and Guidelines // Ask Smart Questions

Andrey_Dmitriev · ‎05-22-2024

@Gregory wrote:

Thank you for the incredible amounts of detail in this discussion!

You're welcome! Also, thanks for the discussion! Special thanks to Rolf for valuable notes. I using such small exercises from time to time for my own learning as well. Glad to see it was helpful for you and hopefully will be helpful for someone else. Behind the scenes, I also decompiled this DLL just out of curiosity about what is inside. I'm not sure about your final goal for this project, but if you would like to develop an image processing library based on LabVIEW's arrays and native LabVIEW code, it is possible, but take a note that this code will be very, very slow because LabVIEW's compiler is still not very efficient. This is the downside we pay for the convenient graphical environment with intelligent memory management. If you do the same in C, then compile the code with an efficient compiler, for example, the Intel OneAPI Compiler (forget about CVI), then you will get a 3x-10x, sometimes more, speed improvement.

rolfk · ‎05-22-2024

I wouldn’t say the compiler is actually bad at all. However it comes from a completely different direction than C compilers. LabVIEW’s dataflow programming is inherently parallel in nature and that requires lots of care when treating data to prevent causing race conditions and data corruption. LabVIEW does a great job with that but it requires extra measures: Data is not just a pointer in memory but an inherently managed object with not only memory management rules but also specific data access rules. This management requires resources in the form of extra code that needs to be executed to guarantee consistency in both time and space.

C on the other hand comes from the origins of everything is a pointer and anyone can access it whenever they want. CPU’s added complicated virtual memory and protected memory mechanisms in hardware to at least allow process isolation so that a rogue or buggy process couldn’t take down other processes but inside a process things are still inherently unprotected. In C you need to program in a way that guarantees that concurrent access can not happen or things simply go awry very fast. For many C programs that is not so difficult since they only use one thread anyways. When you want to do more you have to do serious effort to create more threads and manage them and you have to start to worry about concurrent data access. A C compiler by default assumes that any object passed into a function is the exclusive property of that function for the duration of the function call and it optimizes the code accordingly aggressively. If your calling program doesn’t guarantee that this assumption is right you really are in serious trouble.
For most applications the extra overhead that LabVIEW does to guarantee consistency is relatively small but for big matrix operations such as image processing this overhead can add up.

But blindly moving routines to C doesn’t help. There is always some impedance mismatch between a managed environment like LabVIEW and a different system like unmanaged C or the fairly different management contract that OpenCV has. Unmanaged C is pretty easy in that you can simply follow to use the LabVIEW provided memory management function and be done with it. But that requires to learn all the intricacies of the LabVIEW memory management rules. Interfacing to a different managed environment makes things even more interesting as you now have to understand both management contracts very well and also translate between the two. And such translation very often destroys any performance advantages that you hope to gain from placing certain routines in highly optimized external code. If not designed from ground up with all this in mind your interfacing with an external library for the purpose of performance optimizations is at best a proof that it can be done but without real performance gains, at worst it is a crashing construction site that will cost you all hairs and eventually make you abandoning it in despair.

Rolf Kalbermatter
My Blog

Andrey_Dmitriev · ‎05-22-2024

@rolfk wrote:

I wouldn’t say the compiler is actually bad at all. However it comes from a completely different direction than C compilers. LabVIEW’s dataflow programming is inherently parallel in nature and that requires lots of care when treating data to prevent causing race conditions and data corruption. LabVIEW does a great job ...

No, I won't say that the compiler is bad, and yes, it does an awesome job. However, sometimes it is not very efficient in terms of the "speed" of the generated code. On the other hand, parallelization was never been so easy before. By the way, the code is not always parallel. For example, if I add a scalar like this:

then the code will be executed sequentially because LabVIEW is intelligent enough not to create two threads for this due to overhead. Internally, the large code is split into "chunks", which are executed in parallel, somewhere it was described in NI's kb, but I haven't link in my hands. For sure, two independent while loops side by side will be executed truly in parallel in two threads.

Back to the overall performance — it is quite simple to demonstrate and measure "LabVIEW vs C". Let's continue with the "image increment" example from this topic.
I will simplify the LabVIEW code up to this:

And add "parallel" version:

(I have 16 CPUs logical CPUs)

Now I'll create simple increment like this in two versions - multithreaded and not and will tell to compiler that the memory is aligned and iterations are independent:

SHAREDLIBINTEL_API int fnIncImage(uint16_t* src, uint16_t* dst, int Width, int Height )
{
#pragma vector always
#pragma ivdep
    for (int i = 0; i < Width * Height; i++) {
        dst[i] = src[i] + 1;
    }
    return 0;
}

SHAREDLIBINTEL_API int fnIncImagePar(uint16_t* src, uint16_t* dst, int Width, int Height)
{
#pragma omp parallel for num_threads(16)
#pragma vector always
#pragma ivdep
    for (int i = 0; i < Width * Height; i++) {
        dst[i] = src[i] + 1;
    }
    return 0;
}

It is not the best example for benchmarking, because here not so much computation and the memory is a bottleneck, but anyway.

For src and dst I will allocate aligned to the page size boundary (4096 bytes) memory, exactly like IMAQ Vision does and call like this:

src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);

fnIncImage(src, dst, WIDTH, HEIGHT);

Now the benchmark, full code:

Spoiler

//==============================================================================
//
// Title:		Intel compiler vs LabVIEW Benchmark
// Created on:	22.05.2024 at 12:02:29 by AD.
//
//==============================================================================

#include <Windows.h>
#include <stdio.h>
#include <malloc.h>
#include "include/SharedLibIntel.h"
#include "include/SharedLibLabVIEW.h"
#define WIDTH 1024
#define HEIGHT 1024

#define BEGIN_MEASURE QueryPerformanceCounter(&StartTime); \
	for(int i = 0; i < 100; i++) //amount of repetitions

#define END_MEASURE(Message) 	QueryPerformanceCounter(&EndTime); \
	ElapsedMicroseconds.QuadPart = EndTime.QuadPart - StartTime.QuadPart; \
	ElapsedMicroseconds.QuadPart *= 1000000; \
	ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; \
	ElapsedTime = (double)(ElapsedMicroseconds.QuadPart)/100000.0; \
	printf(#Message " is %.3f \xE6s\n", ElapsedTime);

int main(int argc, char* argv[])
{
	uint16_t* src, * dst;
	Uint16Array srcImage, dstImage;
	int32 dimSizeArr[2] = { HEIGHT, WIDTH }; //rows, cols

	LARGE_INTEGER StartTime, EndTime, ElapsedMicroseconds, Frequency;
	double ElapsedTime;

	printf("Intel vs LabVIEW Benchmark for image %d x %d\n", WIDTH, HEIGHT);
	QueryPerformanceFrequency(&Frequency);

	src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
	dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096);
	srcImage = AllocateUint16Array(dimSizeArr);
	dstImage = AllocateUint16Array(dimSizeArr);

	//warm up
	fnIncImage(src, dst, WIDTH, HEIGHT);
	LabVIEWIncImage(&srcImage, &dstImage);

	BEGIN_MEASURE 	//INTEL Benchmark
		fnIncImage(src, dst, WIDTH, HEIGHT);
	END_MEASURE(fnIncImage)

	BEGIN_MEASURE //INTEL Benchmark Parallel
		fnIncImagePar(src, dst, WIDTH, HEIGHT);
	END_MEASURE(fnIncImagePar)

	BEGIN_MEASURE //LabVIEW Benchmark
		LabVIEWIncImage(&srcImage, &dstImage);
	END_MEASURE(LabVIEWIncImage)

	BEGIN_MEASURE //LabVIEW Benchmark Parallel
		LabVIEWIncImagePar(&srcImage, &dstImage);
	END_MEASURE(LabVIEWIncImagePar)

	_aligned_free(src);
	_aligned_free(dst);
	DeAllocateUint16Array(&srcImage);
	DeAllocateUint16Array(&dstImage);

	return 0;
}

//============================================================================== // // Title: Intel compiler vs LabVIEW Benchmark // Created on: 22.05.2024 at 12:02:29 by AD. // //============================================================================== #include <Windows.h> #include <stdio.h> #include <malloc.h> #include "include/SharedLibIntel.h" #include "include/SharedLibLabVIEW.h" #define WIDTH 1024 #define HEIGHT 1024 #define BEGIN_MEASURE QueryPerformanceCounter(&StartTime); \ for(int i = 0; i < 100; i++) //amount of repetitions #define END_MEASURE(Message) QueryPerformanceCounter(&EndTime); \ ElapsedMicroseconds.QuadPart = EndTime.QuadPart - StartTime.QuadPart; \ ElapsedMicroseconds.QuadPart *= 1000000; \ ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; \ ElapsedTime = (double)(ElapsedMicroseconds.QuadPart)/100000.0; \ printf(#Message " is %.3f \xE6s\n", ElapsedTime); int main(int argc, char* argv[]) { uint16_t* src, * dst; Uint16Array srcImage, dstImage; int32 dimSizeArr[2] = { HEIGHT, WIDTH }; //rows, cols LARGE_INTEGER StartTime, EndTime, ElapsedMicroseconds, Frequency; double ElapsedTime; printf("Intel vs LabVIEW Benchmark for image %d x %d\n", WIDTH, HEIGHT); QueryPerformanceFrequency(&Frequency); src=(uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096); dst = (uint16_t*)_aligned_malloc(WIDTH * HEIGHT * sizeof(uint16_t), 4096); srcImage = AllocateUint16Array(dimSizeArr); dstImage = AllocateUint16Array(dimSizeArr); //warm up fnIncImage(src, dst, WIDTH, HEIGHT); LabVIEWIncImage(&srcImage, &dstImage); BEGIN_MEASURE //INTEL Benchmark fnIncImage(src, dst, WIDTH, HEIGHT); END_MEASURE(fnIncImage) BEGIN_MEASURE //INTEL Benchmark Parallel fnIncImagePar(src, dst, WIDTH, HEIGHT); END_MEASURE(fnIncImagePar) BEGIN_MEASURE //LabVIEW Benchmark LabVIEWIncImage(&srcImage, &dstImage); END_MEASURE(LabVIEWIncImage) BEGIN_MEASURE //LabVIEW Benchmark Parallel LabVIEWIncImagePar(&srcImage, &dstImage); END_MEASURE(LabVIEWIncImagePar) _aligned_free(src); _aligned_free(dst); DeAllocateUint16Array(&srcImage); DeAllocateUint16Array(&dstImage); return 0; }

Warm up calls are necessary to avoid page faults on first calls.

Now the results for 1024x1024 image:

>Benchmark.exe
Intel vs LabVIEW Benchmark for image 1024 x 1024
fnIncImage is 0.151 µs
fnIncImagePar is 0.050 µs
LabVIEWIncImage is 0.376 µs
LabVIEWIncImagePar is 1.373 µs

for 2048x2048:

Intel vs LabVIEW Benchmark for image 2048 x 2048
fnIncImage is 0.646 µs
fnIncImagePar is 0.151 µs
LabVIEWIncImage is 1.801 µs
LabVIEWIncImagePar is 4.627 µs

4096x4096:

Intel vs LabVIEW Benchmark for image 4096 x 4096
fnIncImage is 4.180 µs
fnIncImagePar is 2.374 µs
LabVIEWIncImage is 10.258 µs
LabVIEWIncImagePar is 19.039 µs

and finally for huge 32768x32768:

Intel vs LabVIEW Benchmark for image 32768 x 32768
fnIncImage is 297.311 µs
fnIncImagePar is 176.542 µs
LabVIEWIncImage is 722.550 µs
LabVIEWIncImagePar is 1218.413 µs

As you can see, Intel-compiled version is slightly faster than LabVIEW.

Parallel version in LabVIEW is more slow (because behind increment already optimized library function from LabVIEW's core).

If you interested, some more benchmark/optimization examples from the past NI Forum threads:

Flat Field Correction algorithm implemented with AVX - roughly triple faster than LabVIEW.

Huge 50x performance boost on Gamma-aware image resampling.

And same experiment on Mac OS - 100x ratio (but on Virtual Machine).

So, I didn't say that all code needs to be ultimately rewritten in C; moreover, "premature optimization is the root of all evil" as Donald Knuth said, and there is no "silver bullet", of course. However, rewriting some critical bottlenecks in C, then compiled with a highly optimizing compiler really makes sense in some particular cases (and usually, there are only a few such places across the whole app).

Source code in attachment, but binaries was optimized for my CPU, so it will work on Cascade Lake only with installed Intel oneAPI C++ Compiler Runtime for Windows.

LabVIEW

Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays

Re: Creating a DLL to work on 2D Arrays