Dear all Labview fans,
Motivation:
I'm a physicist student who uses Labview for measurement and also for evaluation of data. I'm a fan since version 6.i (year 2005 or like)
My typical experimental set-up looks like: a lot of different wires going every corner of the lab, and it is left to collect gigabytes of measurement data in the night. Sometimes I do physics simulation in Labview, too. So I really depend on gigaflops.
I know, that there is already an idea for adding CUDA support. But,not all of us has an nvidia GPU. Typically, at least in our lab, we have Intel i5 CPU and some machines have a minimalist AMD graphics card (other just have an integrated graphics)
So, as I was interested in getting more flops, I wrote an OpenCL dll wrapper, and (doing a naive Mandelbrot-set calculation for testing) I realized 10* speed-up on CPU and 100* speed-up on the gamer gpu of my home PC (compared to the simple, multi-threaded Labview implementation using parallel for loops) Now I'm using this for my projects.
What's my idea:
-Give an option for those, who don't have CUDA capable device, and/or they want their app to run on any class of calculating device.
-It has to be really easy to use (I have been struggling with C++ syntax and Khronos OpenCL specification for almost 2 years in my free time to get my dll working...)
-It has to be easy to debug (in example, it has to give human readable, meaningful error messages instead of crashing Labview or making a BSOD)
Implemented so far, by me, for testing the idea:
-Get information on the dll (i.e..: "compiled by AMD's APP SDK at 7th August, 2013, 64 bits" , or alike)
-Initialize OpenCL:
1. Select the preferred OpenCL platform and device (Fall back to any platform & CL_DEVICE_TYPE_ALL if not found)
2. Get all properties of the device (CLGetDeviceInfo)
3. Create a context & a command queue,
4. Compile and build OpenCL kernel source code
5. Give all details back to the user as a string (even if all successful...)
-Read and write memory buffers (like GPU memory)
Now, only blocking read and blocking write are implemented, i had some bugs with non blocking calls.
(again, report details to the user as a string)
-Execute a kernel on the selected arrays of data
(again, report details to the user as a string)
-close openCL:
release everything, free up memory, etc...(again, report details to the user as a string)
Approximate Results for your motivation (Mandelbrot set testing, single precision only so far.):
10 gflops on a core2duo (my office PC)
16 gflops on a 6-core AMD x6 1055T
typ. 50 gflops on an Intel i5
180 gflops on a Nvidia GTS450 graphics card
70 gflops on EVGA SR-2 with 2 pieces of Xeon L5638 (that's 24 cores)
520 gflops on Tesla C2050
(The parts above are my results, the manufacturer's spec sheets may say a lot more theoretical flops. But, when selecting your device, take memory bandwidth into account, and the kind of parallelism in your code. Some devices dislike the conditional branches in the code, and Mandelbrot set test has conditional branches.)
Sorry for my bad English, I'm Hungarian.
I'm planning to give my code away, but i still have to clean it up and remove non-English comments...