LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Parallel Consumers Maintain Data Order

Solved!
Go to solution

@RavensFan Yes you described exactly what is going on. You understood correctly.

@Bob_Schor This system is on a literal brand new computer. Quad Core i7 with 8 threads, 16GB RAM, PCIe SSD.

I'm not saying the processing is slow by any means. It handles the FFTs in under 10ms, but when there are 100k of them, and other events going on, they pile up in the queue with only one FFT running. I attemped to use the SVFA Zoom FFT but we need to cover too large of a frequency range (on the order of 200kHz) and it actually takes about 3x as long as the normal FFT.

The DAQ loop is only concerned with DAQ. It reads the data and passes it off into a channel wire.

 

@drdjpowell That is an interesting approach to the futures idea! I will keep it in mind.

0 Kudos
Message 11 of 23
(2,449 Views)
Solution
Accepted by topic author ConnerP

Ah so I realized that I misunderstood how Tag Channels work. I thought they were like normal queues where the first block to read the data wins and deletes the item from the queue, not like a global variable. So my current implementation is wrong! Oops.

With that in mind, I decided to try out the promise method of @drdjpowell. It requires some overhead but the implementation is really straightforward. I will probably give this a go next week when I am back in the lab.

 

Here is a snippet for the curious. This is just a MWE. I didn't bother with closing the non-promise queues or giving the worker loops stop conditions. This is just a rough example for the promise logic. I chose wait times that are long enough to see the effect, with an offset to show the promise does what it needs to. In my actual case the computation time for both loops is something like 5+-3 ms

 Promises_Example.png

 

Message 12 of 23
(2,434 Views)

I have some admittedly premature skepticism.   On average, you'd be needing to Obtain and Release a "promise queue" every 4 msec.  All these obtain/releases have to add some overhead, I wonder how much that'll eat into your overall number-crunching capability.  It'll probably be important to make them single-element queues (as mentioned by drjdpowell), but 250 new queues per second still just feels like a lot.

 

On the other hand, it's a really clean approach and it's clear that you could expand to any number of "heavy computation" loops operating in parallel.  They could be launched asynchronously at program init, or who knows, maybe you could even get parallel behavior by embedding a heavy computation while loop inside a Parallelized For Loop.

 

I'm definitely keeping this technique in mind for my future work.  In the stuff I do, I can often restrict my packet rates to 20 Hz or less and I'm rarely dealing with sample rates above 10's of kHz.  I'll have plenty of CPU availability for any overhead from queue obtain/releases overhead for my stuff.

 

Please post back if/when you get observations or benchmarking measurements related to both the overhead and any overall improvement due to parallelization.

 

 

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.
0 Kudos
Message 13 of 23
(2,414 Views)

The "Future Tokens" in Messenger Library, which are single-element queues under the hood, benchmark at 25 microseconds from creation to destruction.  That's including using them to pass a message.   Just creating an array of 100k of them and then destroying them (without use) takes 1 second (10 microseconds each).

Message 14 of 23
(2,409 Views)

Some of these ideas have already been expressed here, but I will reiterate and maybe add to it, maybe not.

 

Your rate limiting step is the FFT processing.

 

Here are some possible suggestions.

  • Do you need Mag & Phase, or Re & Im. If not use the Power Spectrum VI. Mag & Phase, or Re & Im FFTs result in twice the memory output, you can use the Power Spectrum VI and do an inplace memory swap with your input. (This is what I do, See below). My data gets sent as a 2d array to the data processing loop, I do an inplace swap of data with the Power Spectrum VI, then I only display half of my array, no need for negative frequencies. Snap6.png

     

  • Set a "Busy" Flag. (This only works for displaying the data.) For my application I am streaming data continuously at 2MSa/s per channel, up to 8 channels. The data is recorded/saved as time domain data. For the display, the user can choose FFT to see the data's frequency spectrum. While the data loop is crunching the FFT, I set a busy flag that gets sent to my instrument loop, if the instrument loop sees the busy flag, it does not send updated data to the data loop for processing. (If you need to save your FFTs then this step is no good.)
  • Decimate your display data. Use a Min/Max type of decimation so your peaks are still visible.

mcduff

 

Message 15 of 23
(2,406 Views)

@Kevin_Price
According to the help page for queues, the max queue size does not influence memory allocation. We shall see when I can try it out on the real system... However, I do agree it makes sense to explicitly set the max size as 1 just for the sake of it.


Note  When not running on an RT target, max queue size only limits the number of elements in the queue and does not preallocate the queue.

 

@drjdpowell

Thanks for the benchmark numbers. That is good to know!

 

@mcduff

Unfortunately phase will eventually be of interest. However, I will keep this DVR use case in mind for the future. And in regards to the busy flag, we only care about the frequency domain information. Thanks for your input.

0 Kudos
Message 16 of 23
(2,391 Views)

According to the help page for queues, the max queue size does not influence memory allocation. We shall see when I can try it out on the real system... However, I do agree it makes sense to explicitly set the max size as 1 just for the sake of it.


Note  When not running on an RT target, max queue size only limits the number of elements in the queue and does not preallocate the queue.

 

See https://forums.ni.com/t5/LabVIEW/Queue-Memory-Allocation-Weirdness/td-p/1988609

 

No idea if it is still true for newer versions. But SET IT TO 1.

 

mcduff

0 Kudos
Message 17 of 23
(2,389 Views)

In my youth (i.e. 4-5 years ago), I remember reading about how to make Queues "lean, mean, and fast".  I recall the suggestion to (a) create with a finite (but large enough) size, (b) "stuff" the Queue with data to pre-allocate the memory, (c) flush the Queue, and then it was Ready to Rumble.  Don't know if this advice still applies ...

 

Bob Schor

0 Kudos
Message 18 of 23
(2,384 Views)

Time goes on and we learn some things along the way. Way back when I was moving this project from prototyping to a more final stage I discarded the promise queues in lieu of developing a thread pool manager which distributed tasks to worker threads, who then fed it back to the manager to distribute the result as it pleased. I was very gung-ho about learning software engineering theory and thought "yea, this works great with the 'do one thing well' philosophy"...

 

However, doing one thing well gets trumped by simplicity and using standard tools for the job. And there was a in fact really simple solution to this problem using built-ins and much less hocus pocus. See the snippet below. The modulo_wait.vi is just a simple demonstrator operation to keep the snippet clean. Actual application in my case is a ~100,000 sample FFT

 

MultipleWorkers_MaintainOrder.png

Message 19 of 23
(2,121 Views)

@ConnerP wrote:

 


Thanks for sharing your solution, I may need to borrow it, Smiley Wink

 

A couple of quick questions or thoughts:

  1. All of the workers can be launched relatively quickly, however, the results (dequeue) will need to wait until the slowest worker is finished. For example assume i=4 finishes before i=2, i=3, that result can not be dequeued until 2 and 3 are finished since the dequeue is a FIFO. I assume this doesn't affect anything in the long run, but have you had any problems with it?
  2. Your example VI, modulo wait has shared clone reentrant execution. In the real case I assume you would want Preallocated clone reentrant execution. Am I mistaken here?

Cheers,

mcuff

0 Kudos
Message 20 of 23
(2,107 Views)