Parallel Consumers Maintain Data Order

ConnerP · ‎04-27-2018

@RavensFan Yes you described exactly what is going on. You understood correctly.

@Bob_Schor This system is on a literal brand new computer. Quad Core i7 with 8 threads, 16GB RAM, PCIe SSD.

I'm not saying the processing is slow by any means. It handles the FFTs in under 10ms, but when there are 100k of them, and other events going on, they pile up in the queue with only one FFT running. I attemped to use the SVFA Zoom FFT but we need to cover too large of a frequency range (on the order of 200kHz) and it actually takes about 3x as long as the normal FFT.

The DAQ loop is only concerned with DAQ. It reads the data and passes it off into a channel wire.

@drdjpowell That is an interesting approach to the futures idea! I will keep it in mind.

ConnerP · ‎04-27-2018

Ah so I realized that I misunderstood how Tag Channels work. I thought they were like normal queues where the first block to read the data wins and deletes the item from the queue, not like a global variable. So my current implementation is wrong! Oops.

With that in mind, I decided to try out the promise method of @drdjpowell. It requires some overhead but the implementation is really straightforward. I will probably give this a go next week when I am back in the lab.

Here is a snippet for the curious. This is just a MWE. I didn't bother with closing the non-promise queues or giving the worker loops stop conditions. This is just a rough example for the promise logic. I chose wait times that are long enough to see the effect, with an offset to show the promise does what it needs to. In my actual case the computation time for both loops is something like 5+-3 ms

Kevin_Price · ‎04-27-2018

I have some admittedly premature skepticism. On average, you'd be needing to Obtain and Release a "promise queue" every 4 msec. All these obtain/releases have to add some overhead, I wonder how much that'll eat into your overall number-crunching capability. It'll probably be important to make them single-element queues (as mentioned by drjdpowell), but 250 new queues per second still just feels like a lot.

On the other hand, it's a really clean approach and it's clear that you could expand to any number of "heavy computation" loops operating in parallel. They could be launched asynchronously at program init, or who knows, maybe you could even get parallel behavior by embedding a heavy computation while loop inside a Parallelized For Loop.

I'm definitely keeping this technique in mind for my future work. In the stuff I do, I can often restrict my packet rates to 20 Hz or less and I'm rarely dealing with sample rates above 10's of kHz. I'll have plenty of CPU availability for any overhead from queue obtain/releases overhead for my stuff.

Please post back if/when you get observations or benchmarking measurements related to both the overhead and any overall improvement due to parallelization.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

drjdpowell · ‎04-27-2018

The "Future Tokens" in Messenger Library, which are single-element queues under the hood, benchmark at 25 microseconds from creation to destruction. That's including using them to pass a message. Just creating an array of 100k of them and then destroying them (without use) takes 1 second (10 microseconds each).

mcduff · ‎04-27-2018

Some of these ideas have already been expressed here, but I will reiterate and maybe add to it, maybe not.

Your rate limiting step is the FFT processing.

Here are some possible suggestions.

Do you need Mag & Phase, or Re & Im. If not use the Power Spectrum VI. Mag & Phase, or Re & Im FFTs result in twice the memory output, you can use the Power Spectrum VI and do an inplace memory swap with your input. (This is what I do, See below). My data gets sent as a 2d array to the data processing loop, I do an inplace swap of data with the Power Spectrum VI, then I only display half of my array, no need for negative frequencies.

Set a "Busy" Flag. (This only works for displaying the data.) For my application I am streaming data continuously at 2MSa/s per channel, up to 8 channels. The data is recorded/saved as time domain data. For the display, the user can choose FFT to see the data's frequency spectrum. While the data loop is crunching the FFT, I set a busy flag that gets sent to my instrument loop, if the instrument loop sees the busy flag, it does not send updated data to the data loop for processing. (If you need to save your FFTs then this step is no good.)
Decimate your display data. Use a Min/Max type of decimation so your peaks are still visible.

mcduff

ConnerP · ‎04-27-2018

@Kevin_Price
According to the help page for queues, the max queue size does not influence memory allocation. We shall see when I can try it out on the real system... However, I do agree it makes sense to explicitly set the max size as 1 just for the sake of it.

Note When not running on an RT target, max queue size only limits the number of elements in the queue and does not preallocate the queue.

@drjdpowell

Thanks for the benchmark numbers. That is good to know!

@mcduff

Unfortunately phase will eventually be of interest. However, I will keep this DVR use case in mind for the future. And in regards to the busy flag, we only care about the frequency domain information. Thanks for your input.

mcduff · ‎04-27-2018

According to the help page for queues, the max queue size does not influence memory allocation. We shall see when I can try it out on the real system... However, I do agree it makes sense to explicitly set the max size as 1 just for the sake of it.

Note When not running on an RT target, max queue size only limits the number of elements in the queue and does not preallocate the queue.

See https://forums.ni.com/t5/LabVIEW/Queue-Memory-Allocation-Weirdness/td-p/1988609

No idea if it is still true for newer versions. But SET IT TO 1.

mcduff

Bob_Schor · ‎04-27-2018

In my youth (i.e. 4-5 years ago), I remember reading about how to make Queues "lean, mean, and fast". I recall the suggestion to (a) create with a finite (but large enough) size, (b) "stuff" the Queue with data to pre-allocate the memory, (c) flush the Queue, and then it was Ready to Rumble. Don't know if this advice still applies ...

Bob Schor

ConnerP · ‎05-01-2019

Time goes on and we learn some things along the way. Way back when I was moving this project from prototyping to a more final stage I discarded the promise queues in lieu of developing a thread pool manager which distributed tasks to worker threads, who then fed it back to the manager to distribute the result as it pleased. I was very gung-ho about learning software engineering theory and thought "yea, this works great with the 'do one thing well' philosophy"...

However, doing one thing well gets trumped by simplicity and using standard tools for the job. And there was a in fact really simple solution to this problem using built-ins and much less hocus pocus. See the snippet below. The modulo_wait.vi is just a simple demonstrator operation to keep the snippet clean. Actual application in my case is a ~100,000 sample FFT

mcduff · ‎05-01-2019

@ConnerP wrote:

Thanks for sharing your solution, I may need to borrow it,

A couple of quick questions or thoughts:

All of the workers can be launched relatively quickly, however, the results (dequeue) will need to wait until the slowest worker is finished. For example assume i=4 finishes before i=2, i=3, that result can not be dequeued until 2 and 3 are finished since the dequeue is a FIFO. I assume this doesn't affect anything in the long run, but have you had any problems with it?
Your example VI, modulo wait has shared clone reentrant execution. In the real case I assume you would want Preallocated clone reentrant execution. Am I mistaken here?

Cheers,

mcuff

LabVIEW

Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order

Re: Parallel Consumers Maintain Data Order