07-26-2012 11:03 AM
I am trying to ascertain just how fast LabVIEW RT can perform a simple analog input to analog output loopback. The purpose of this test is to benchmark the system and discover whether I need to go the FPGA route for my application.
First the hardware that I am using:
RMC-8354 with RTOS installed (PCIe to PXIe-8388 card)
PXIe-1078 chassis
PXIe-6358 X-Series MIO card
PXI-6733 AO card
The external physical connections are as follows:
PXI-6733 Output Channel AO0 --> PXIe-6358 Input Channel AI0
PXIe-6358 Output Channel AO0 --> PXIe-6358 Input Channel AI1
In as concise a manner as I can muster, here is a description of the VI I am using (included below):
1. Set up analog inputs and timing (channels AI0 and AI1 on PXIe-6358)
2. Set up analog outputs and timing (channels AO0 on PXIe-6358 and AO0 on PXI-6733)
3. Route analog input sample clock to analog outputs
4. Route analog input sample trigger to analog outputs
5. Create Gaussian White Noise signal to be played out of channel AO0 on PXI-6733
6. Set channel AO0 on PXIe-6358 disable regeneration and write relative to the current position in the buffer
7. Start DAQmx task on channel AO0 on PXI-6733
8. Start DAQmx task on channel AO0 on PXIe-6358
9. Create a binary file for saving all of the data for post processing (and write the actual sample rate to the file)
10. Start DAQmx task on channels AI0 and AI1 on PXIe-6358 and set task to read all available samples
11. Begin timed loop
12. Within the timed loop, this is intended process:
A. Read all available samples
B. Parse out samples due to channel AI1 and immediately send them to AO0 on PXIe-6358
C. Build an array containing the relevant samples from channel AO0 on PXI-6733 and channels AI0 and AI1 on PXIe-6358
D. Write array to binary file for post processing
13. End timed loop after number of iterations based on acquisition time, actual sample rate, and timed loop period
14. Close binary file and end DAQmx tasks
So here are my questions and concerns:
1. At low sample rates (1kHz and lower) and loop periods above 1ms, the relationship between AI0 and AI1 appears stable. My concern is not that there is a delay between the channels (since that is expected), but rather that the delay is constant during the entire acquisition. For modestly higher sample rates (on the order of several kHz), the process appears relatively stable for 1ms loop periods and short duration tests (10 seconds or less). For anything on the order of a 10 kHz sample rate or above and at nearly any loop period, the acquisition is only good for a second or two, at most. After that it appears that samples cannot be delivered fast enough to the analog output (AO0 on PXIe-6358), and the delay between the two analog input channels worsens over time. The specific behavior observed is that several samples taken on AI1 will appear to be constant, indicating that AO0 on PXIe-6358 is maintaining the last voltage value from its buffer until more samples are available.
2. Since the VI I have tried to set up is conceptually fairly simple, is the way I have constructed everything the ideal case for the RTOS? That is to say, when I find a stumbling block for the VI in terms of maximum sample rate or minimum loop period, should I assume that I have reached the limitation of the hardware/software that I am using?
3. Is it safe to say that an appropriately configured FPGA setup mirroring the tasks I have mentioned here would be able to perform considerably faster?
For reference, the target application requires something on the order of 10 inputs and 5 outputs at a sample rate of no less than 5kHz. Consider the processing between the inputs and outputs to require processing on the order of an FxLMS routine.
I realize that it is huge (and please excuse the vertical lines where I stitched everything together), but here is the aforementioned VI:
07-29-2012 11:11 PM
Hello Razor,
The primary constraint here is processor time being used in each timed loop- in this case we are doing the following operations each time the loop iterates:
?) Acquire whatever samples are currently available
?) Copy these samples into two new memory locations
?) Use one of these copies to determine how many samples we just grabbed
?) Add this size to the shift register
?) Index a pre-built array at the current location
?) Append new data to this array
?) Pull sample #1 out of the input and write to output
?) Append everything to a binary file
?) Convert my index integer into a double, compare to a set value
?) Pull a value from a front panel indicator (which won't exist in the compiled RT application)
11) Repeat or stop depending on result of last two steps
I started off with numbers, but replaced them with question marks because there isn't really a way to determine which operation will happen in which order in the sequence as it is currently implemented. For instance, do we write to file before or after we update the output? There's really no way to know. Your testing indicates that this works well enough at <1kHz, but I'm not surprised that we start getting bogged down past that speed. Array manipulation, in particular, is fairly expensive with regards to processor time, and we are doing a decent amount of that. I feel strongly that you would see a performance increase if the array operations were eliminated, optimized, or (ideally) offloaded to another loop/thread entirely. Additionally, you’ll want to do your logging in a parallel loop.
You may also want to look into buffering your inputs, performing the processing in a parallel loop, and outputting a slightly delayed but more consistent signal. With a quad-core controller, the more you can parallelize/pipeline your operations, the better off you’ll be in regards to iteration time.
Whether or not you can increase the loop rate by 5x is questionable depending on the algorithm you plan to use, but I think it would be worth trying to optimize this code before jumping straight into FPGA programming. Definitely try benchmarking different portions of your code using the basic timing functions and/or frames within your timed loop to determine where you should focus your efforts. If you’re interested, the following link also has a good deal of information on benchmarking and optimization on RT systems:
Advanced LabVIEW Real-Time Development Resources and Benchmarks
http://www.ni.com/white-paper/5686/en
Regards,
08-01-2012 06:00 AM
Hi Tom,
Thanks for your reply!
It seems like the general thought process behind RT applications is to parallelize whenever possible to avoid missing deadlines. My naïveté guided me to the simplest methodology I could think of (i.e. sequential operations) which, more and more, appears to be going the way of the dinosaur.
I will endeavor to pursue the suggestions you have put forth. For clarity I will reiterate:
-Move data logging to a separate loop (inevitably utilizing RT FIFO queues, which is redundant, between loops)
-Move data processing to a separate loop (using inputs to determine outputs, again utilizing RT FIFO queues)
Ultimately it sounds like 3 separate loops running concurrently on separate cores should yield the greatest boost in performance, for this simple example at least.
My thought process behind taking all available samples from the input was to avoid any jitter associated with the timed loop. If I had to take one or two extra samples here, or one or two fewer samples there, I figured the "all available samples" parameter would adjust the input size (as well as the output size) accordingly.
Thank you for the link you provided. I have been filling my favorites with all of the NI white papers I can find to expand my collection of references.
Thanks again,
Philip
08-02-2012 01:06 PM
Hello Philip,
You're correct about utilizing all available samples, however in this case you'll also want to consider the jitter that you may induce by performing array resize operations on every read. it may only be a difference of a few elements, but dynamically allocating arrays of different sizes may slow things down by a few ticks.
As for the suggestions you reiterated, I think the first thing to try would be offloading the logging, and as long as the logging itself doesn't need to be deterministic, standard queues may suffice and would involve less overhead. You may want to try benchmarking iteration time and jitter when using RT FIFOs and pre-allocated queues to determine which would serve best in your application. Another thing to note is that when passing elements into either RT FIFOs or queues, you'll want to enqueue identically-sized arrays to prevent memory resize and allocation operations.
Also, I would recommend benchmarking the processing you are going to implement later without any hardware I/O involved (static arrays or something already in memory) to determine if a .2 ms turnaround is feasible, as this may be the deciding factor in going to FPGA or not.
Overall, it definitely seems like you're headed in the right direction, and asking the right questions - I hope my comments have been helpful, and best of luck!
08-02-2012 03:42 PM
Philip,
I found the cRIO Developers Guide very useful for the best practices for implementing deterministic / non-deterministic process and appropriate communications methods between different process types and location.
Andy