Inline subvi's and memory usage

ColeV · ‎03-11-2014

Hello all,

I have an application that acquires large sets of data (Arrays of 1000x 600000 singles or I32s) from a camera via IMAQ, processes it using LabVIEW GPU, and then saves that data for post-acquisition analysis. According to Memory Management for Large Data Sets:

Create large block diagrams. LabVIEW creates copies of data when calling subVIs.

However, this is currently causing large issues with code readability, modularity and scalability. Currently the size of the main .vi requires so much scrolling it requires a significant amount of work to find portions of the code and tracing the execution, debugging and merging is a nightmare. This is also causing big issues as I bring other developers onto the project.

I have mitigated this somewhat using the techniques in Managing Large Data Sets in LabVIEW, where most of the large arrays are stored in pre-allocated queues and these queues are frequently used in our post analysis steps inside of subvi's. I would like to encapsulate a number of the most frequently used sections in our acquisition and processing steps in subvi's and inline them to hopefully retain the same speed (this application would ideally be a real-time application using DSPs or FPGAs if not for other constraints) as before as described in VI Execution Speed under the "Subvi Overhead" section:

"A third way to minimize subVI overhead is to inline subVIs into their calling VIs. When you inline a subVI, LabVIEW inserts the compiled code of the subVI into the compiled code of the calling VI. If you then make changes to the subVI, LabVIEW recompiles all calling VIs of that subVI to include those changes. Essentially, inlining a subVI removes the need to call the subVI at run time. Instead, LabVIEW executes the subVI code inside the compiled code of the calling VI.

Inlining subVIs is most useful for small subVIs, subVIs within a loop, subVIs with unwired outputs, or subVIs you call only once. To inline a subVI, place a checkmark in the Inline subVI into calling VIs checkbox on the Execution page of the VI Properties dialog box. You must also select Preallocated clone reentrant execution on the same page of the dialog box. LabVIEW automatically preallocates clones for each instance when it inlines the subVI."

My question is, if I don't use queues or functional globals within my inlined subvi, will the subvi generate copies of my data? Are there any other bottlenecks I could be introducing by doing this (such as having debugging enabled on the subvi but disabled on the calling vi)?

Darin.K · ‎03-11-2014

Other than the pain of debugging, I find that inlining is usually a net positive. It seems to help the compiler perform a few more optimizations that it will not do with a subVI.

My experience is that DVRs are compatible with good LV style and software practices. I would utilize them if possible in your code if it is becoming unmanagable.

My other advice would be that if you find that you have to sacrifice good software practices for the sake of performance it is time to shift some tasks to C/C++ where you can maintain code quality and interface with LV via CLFNs.

ColeV · ‎03-12-2014

It seems like Data Value References offer some similar functionality to queues for my purposes and I could likely find some locations where DVRs would reduce copies I am making compared with queues. Regarding the style and software practices, I think DVRs and queues are similar.

Migrating portions of the code to C++ is something we are investigating, and is what prompted this discussion, but of course there are always trade-offs. For Call Library Function Nodes, the library would have to do everything at the time it is called. It doesn't appear to be possible to call library_initialize.dll to get the GPU initialized with the data we want to use to process our camera data and then call a library_run.dll to run the actual GPU code. This could mean using a seperate program C++ program running alongside the LabVIEW program. LabVIEW handles the UI and some of the hardware, C++ handles the processing and the rest of the hardware and we use some kind of communication between the two.

I started looking into the best frameworks for doing this (actor framework?) and realized that trying to maintain the code with it's current style/practices which were required for performance while trying to investigate new frameworks probably wasn't sustainable. Now seemed like the time to clean house to ensure the transition is as easy as possible.

Darin.K · ‎03-12-2014

@ColeV wrote:

It seems like Data Value References offer some similar functionality to queues for my purposes and I could likely find some locations where DVRs would reduce copies I am making compared with queues. Regarding the style and software practices, I think DVRs and queues are similar.

Migrating portions of the code to C++ is something we are investigating, and is what prompted this discussion, but of course there are always trade-offs. For Call Library Function Nodes, the library would have to do everything at the time it is called. It doesn't appear to be possible to call library_initialize.dll to get the GPU initialized with the data we want to use to process our camera data and then call a library_run.dll to run the actual GPU code. This could mean using a seperate program C++ program running alongside the LabVIEW program. LabVIEW handles the UI and some of the hardware, C++ handles the processing and the rest of the hardware and we use some kind of communication between the two.

I started looking into the best frameworks for doing this (actor framework?) and realized that trying to maintain the code with it's current style/practices which were required for performance while trying to investigate new frameworks probably wasn't sustainable. Now seemed like the time to clean house to ensure the transition is as easy as possible.

I agree that queues and DVRs are very similar. I find that queues get the job done while DVRs get the job done while better describing my intentions in the program. At this point I usually will not replace queues with DVRs unless I am refactoring the code, but all new code gives strong consideration to using DVRs instead.

You can easily manipulate objects in heap memory using your C++ code. This gives the objects a lifetime independent of the scope of the creator. You then have separate CLFN functions to create, manipulate and destroy your objects. No need to have a separate process running in parallel. You have some extra overhead managing the objects (you are responsible for deleting them when you are finished), and you will have pointers running around your LV code disguised as integers. There are ways to mitigate this, I usually let the lowest level VIs (those at the LV-C++ interface) treat the pointers as integers. By the time you reach the API level, the pointers are usually encapsulated inside a VI or inside of a class.

ColeV · ‎03-12-2014

You can easily manipulate objects in heap memory using your C++ code. This gives the objects a lifetime independent of the scope of the creator. You then have separate CLFN functions to create, manipulate and destroy your objects. No need to have a separate process running in parallel. You have some extra overhead managing the objects (you are responsible for deleting them when you are finished), and you will have pointers running around your LV code disguised as integers. There are ways to mitigate this, I usually let the lowest level VIs (those at the LV-C++ interface) treat the pointers as integers. By the time you reach the API level, the pointers are usually encapsulated inside a VI or inside of a class.

Do you know of any resources describing how exactly to do this? I have experience programming stand-alone applications in C++ but am a complete noob when it comes to integrating dll's with LabVIEW. I had come across this thread on cuda pointers as well as the comments on this simple example had indicated to me that it was much more difficult than what you describe. Of course, both of those examples explicitly talk about CUDA, not C++ in general. Maybe there is a way to do what you describe in C++ on regular arrays? I can think of uses for this even if CUDA is not involved.

I am just finding in the documentation now about callbacks...

Rob_Calhoun · ‎03-12-2014

> I have an application that acquires large sets of data (Arrays of 1000x 600000 singles or I32s...)

OK, that is pretty big :-).

Although out-of-date, the Using External Code in LabVIEW pdf is pretty comprehensive. Use that if the more up-to-date manuals are too skimpy. (Ignore the part about using CINs, you want to use a shared library on all platforms.)

Creating functions in C/C++ is easy if you can get away with either of these two situations:

a) you can allocate memory in LabView and pass it as a pointer to the C/C++ function

b) you need to allocate the memory in the function as a buffer but don't need to pass it back to LabView.

These can be combined; for example, say your C function needs to allocate a 2 GB buffer, but you can get away with LabView only working with 512KB at a time. In the first VI call a function in your DLL that malloc()s 2 GB of ram and passes a pointer (a scalar) back to LabView.This you basically treat as an opaque refnum. (This could be a pointer to a struct of pointers or whatever you need for your application.) Since the DLL is loaded by LabView, the memory is allocated in your process, but it is not directly accessible to LabView.

Later you want to inspect/modify 512 KB of data starting at index 1234567. No problem: in the analysis VI, you allocate 512 KB of singles (using the LV initialize array function), pass the refnum (buffer pointer) and the pointer to the array of singles into a function in your dll that copies data from your internal buffer to the LabView buffer. Output is the same 512 KB buffer, now populated with your data. This is a regular LabView buffer, so you can split the wire, pass to a SubVI, etc.

Finally when you are done with the 2GB buffer, you pass the refnum (buffer pointer) to your cleanup VI, that calls your cleanup function in your dll that calls free() on the pointer.

If you can live with these limitation, you do not need to link against any LabView-specific libraries and the implementation is straightforward. You can just write in regular C/C++ (with C++ turn off name mangling.)

However sometimes you do need to allocate memory in the DLL and pass it back to LabView. Many built-in functions work like this, for example "Ramp Pattern" in the signal generation VIs; you pass tell it "I want n samples" and it passes back a length-n array. To do this you must call the LabView memory manager functions instead of malloc() so that your C code allocates a buffer the same way LabView itself does.(A related case is where you need to allocate a handle with LabView so you can pass it to a third-party DLL.) These are defined in LabView/cintools/extcode.h; you can also browse the available functions by using the "Call Library Function" node on LabView.exe itself., which shows all of the exported functions. (In fact you can call DSNewPtr and DSNewHandle this way.) It's not hard but I unless I am calling somebody else's DLL, I usually take the easy way out and allocate memory in LabView. Then LabView takes care of releasing it as well.

Good luck! You can go a long with 99% of the code in LabView and 1% in C/C++ where you really need it.

Rob

p.s. You will see a reference to a "pointer-sized integer". The "pointer sized integer" holds 32 bit pointers on x86 and 64 bit pointers on x64 LabView. However it is always 64 bits on the diagram because of LabView's strong typing; it allows the same LabView code to load on both 64 bit and 32 bit LabView implementations.Anyway, that is what you want to use for pointers, although you could also use a U32 on x86 and U64 on x64 and it would be fine.

DFGray · ‎03-13-2014

SubVIs do not always cause memory copies. If the terminals in the subVI are on the root block diagram (i.e. NOT in a structure of any sort), the in-placeness algorithm usually does a good job. I have passed huge data sets through VIs without issue. However, this is very dependent upon what version of LabVIEW you are using and on your code. Newer versions will be much better than old. About LabVIEW 2012 (more or less), there were optimizations introduced to avoid copies of large arrays going into and out of subVIs.

Be careful with using external DLLs. As mentioned above, it is very easy to make copies of data when going into and out of them.

I second the use of DVRs for handling large data in new code. I used to use single-element queues, but the semantics of DVRs are virtually identical, but DVRs are easier to use and have more obvious effects on the block diagram. The only thing lost in going from DVRs to single-element queues is the concept of a named referece. Queues have it, DVRs do not. This may be why DVRs are a touch faster than single-element queues (last time I checked ).

If you want more control over your inlining than the LabVIEW automatic inlining, you can do this yourself with VI server calls. Check out this post for details. Unlocked versions of the code are in the last post to the thread.

Yamaeda · ‎03-13-2014

As a small note, it'd probably be easier on the system to use an array of cluster (or variants) with an array of doubles in the cluster, and store each 600000 in one, instead of a massive 1000x600000. That would also hold true if you change to DVRs.

Using a massive diagram is rarely a good idea, as mentioned data copies shouldn't be much of a problem if the sub-vi's are well formed, and especially if you change to DVR it'll be a non-issue. The benefits of refactoring the program with sub-vi's far outweighs most other concerns, i'd say.

/Y

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

ColeV · ‎03-14-2014

Thanks everyone for your input! It sounds like I should be safe with inlining. I am going to start investigating more into what my options are with C++ DLLs.

I do agree with you all that large block diagrams are not the way to go. Unfortunately, short project deadlines meant that I wasn't able to test these options to ensure that I was still getting the performance I needed while maintiaining quality software. I wish the document on managing large datasets had stressed inlining and highlighted examples where copies were and were not made in subvi's rather than simply saying "build a bigger block diagram". It's definitely time for me to deal with my technical debt!

DFGray: The application has a pretty extensive user interface. That means I have a UI loop and a processing loop working off of a queued message handler. So unfortunately all of our subvi's are located in case structures for the queued message handler. Some of them also require looping over them. For instance: stream data off a camera, pushing the data into a queue in one loop. Pull that data out of the queue in a parallel loop and push to a GPU for processing. Put that result into a second queue. Pull the data out of the second queue in a third parallel loop, perform some statistics, update the DAQ for controlling our system. Additionally, most of this data can't be discarded and is required for off-line analysis after the fact.

It's easy to see how 3 parallel loops with lots of processing, saving and queue push/pulling can create a large block diagram. Add error handling for timeouts, and I am constantly scrolling...

Yamaeda: Because this data is pulled from a camera with a particular format, it seemed best to keep that format for as long as possible. In the offline analysis there is a step where we push the data to a new single-element queue so the format could change then if there was a good reason to. What's your reasoning on it being easier on the system for this format? Does it allow easier storage because the system doesn't need to find a contiguous block of memory? Would that lead to fragmentation? One of the first steps my code performs is filling up the queues with the maximum possible data size for the next run and then flushing it as I understand this pre-allocates the memory block. I think this should prevent this kind of fragmentation but I should mention that if the user performs another acquisition, the queue is flushed and released and then a new queue is obtained, allocated with dummy data and then the user is signaled that they are OK to proceed. Perhaps this is the wrong way to do this part?

DFGray · ‎03-14-2014

My apologies, my previous post was not clear. SubVIs can be located inside structures in the calling VI and still maintain inplaceness. In the subVI itself, any inputs/outputs you wish to be inplace must have their terminals located on the root of the subVI's block diagram and inputs must be required inputs.

LabVIEW

Inline subvi's and memory usage

Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage

Re: Inline subvi's and memory usage