Help with Scope Acq/processing speed

TimS. · ‎09-16-2010

Hello,

I'm beginning to realize that I might have a big problem with a data acquisition application that I'm currently developing... I'm using several PXI-5922 digitizers to acquire different RF bands. The fastest sample rates I'm using are 3 channels running at 7.5 MSps (real). My application requires that the data be frequency translated, filtered, and decimated after acquisition for each channel. After the processing section it is then blasted out a 10GbE via UDP multicast to consumers of the data.

Unfortunately, I'm running into huge data block loss issues when running just a single channel let alone all three. I really need some advice and options for me to speed up processing of the data so that it can keep up with reading data from the hardware. This is a time critical and high data reliability based application. Please see the attached figures, snippets, and details below. I'll try to provide as much information as I'm allowed to so I can paint a clear picture of my hardware/software setup.

Setup:

Dell R5400 workstation, dual Intel E5405 quad-core processors, 4GB memory, Win7 32-bit, LV2010
PXIe-1065 chassis with MXIe link (PXIe-8370 to PCIe-8371)
Intel X520-DA2 10GbE Network Card

Here is the complex modulator (frequency translation to create complex data). I know that this code works correctly since it has already been used successfully in other applications. I am providing it as reference though so that anyone can provide advice on how to potentially make this section of code faster than using a sample-by-sample implementation. This piece of code along with the DFD M-Rate continuous processing VI get called extremely fast at the 7.5 MSps rate I'm doing.

As a note, I've also gone through several different procedures to turn off just about anything in Windows 7 that I don't need to help speed up performance and increase resources. All subVI's in the processing code have to be reentrant for processing all RF channels simultaneously so I don't believe that I can set them to subroutine or use inlining. However, I have set all processing subVI's to be time critical priority and have turned off automatic error handling as well as debugging.

The main things that I have room to work with and optimize are NI-Scope fetch block size, processing code block size, frequency translation execution speed, and changing the DFD halfband filter to have less coefficients. I've already reduced the halfband filter to less stopband attenuation and larger roll-off factor than I would prefer so I'm reluctant to reduce the number of coefficients even further.

I apologize for this very long post but I'm getting desperate in trying to figure out how I can reliably acquire my data at these high sample rates. Our hardware/time budget most likely won't have enough in it to do any FPGA offloading development but I realize that if this is my only option then that may be the way we have to go for the high sample rate channels.

Thank you for any help in advance,

Tim Sileo

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

DFGray · ‎09-17-2010

Given your hardware setup, there are lots of opportunities for optimization.

You have eight processor cores, do you know if you are actually using them all (use Task Manager to find out)? You can parallelize this operation in a number of ways. You can split your analysis into one loop per data stream from the scopes (use .VITs and launch multiple instances using VI server - see this tutorial or this nugget for examples). You can split the analysis of single chunks in to multiple loops. You can split acquisition of each scope into separate loops, although bus contention may cause you issues even though your data rates are relatively low by PCIe standards.
It appears you can rewrite your analysis without the FOR loop. Your functions will all take array inputs. Simply remove the FOR loop and the data will probably adapt correctly.
Do you need to bundle your real and imaginary parts into a complex double? You should be able to pick up some speed by keeping these as separate arrays. You separate them and recombine them in your analysis.
You do not need the WHILE loop. Use a feedback node instead of the shift register. It will give you a small performance boost. If you feel uncomfortable with this, use a single cycle FOR loop instead. It is faster than the single-cycle WHILE loop.
You should definitely run some benchmarks with different block sizes. Ages ago, when I was playing with this type of thing, the best block size for fetch speed was about 300,000 bytes. However, that was with 5911s and 5112s on a PCI bus. You will need to play with this for best transfer speed. Note that best transfer speed for one stage is not necessarily best transfer speed for another. You will probably need to buffer each stage and use the optimal buffer size for each.
Acquire your data as integers instead of DBL or complex. This will halve the bandwidth requirements of your system. It appears you can do much of your analysis with integers, as well (multiplication, addition, and subtraction are exact operations on integers), unless you have overflow issues. You can keep scaling factors as separate variables and only use them when you multicast the results (or simply cast the integers and the scaling factors).

Good luck. If you have any questions, let us know.

TimS. · ‎09-30-2010

Hi DFGray,

I apologize for the long delay in response. I've just been trying one thing after another trying to get this to work. A mix of the things you've suggested as well as other processing enhancements. Unfortunately, I'm still not keeping up with the data rate and I am focusing on just one scope channel right now. Eventually I will need to scale up to 3 channels at the 7.5MSps real data rate.

In answer to your suggestions:

1. The acquisition, frequency translation, and filtering/decimation are all running in separate threads and task manager does indeed show this. I haven't gone further down to manually assing processor affinity or anything like that but I can definitely figure out from CPU usage which cores are doing the processing code. As a note, I have changed my filtering and decimation step to use FFT->Array Subset of bins of interest multiplied by FFT(FIR coeffecients)->IFFT back to time domain. This implementation was much much faster than the Digital Filter Design toolkit's M-Rate filtering VI due to it using convolution in the time domain. This VI is also pipelined to increase the speed. I've attached the polymorpic VI I created for reference and I'm using the FreqDomainFilterAndDecimate(CSG).vi currently in my code. In order to open the VI you will also need the LabVIEW High Performance Analysis Library 2.0 found here http://decibel.ni.com/content/docs/DOC-12086. I found the HPAL while searching for a way to speed up the FFT/IFFT process and it definitely helped.

2&4. I kept the FOR loop in the complex_modulator VI because of the sample by sample generation of the phase multiplier (The frequency mixer phase value for each input data sample). If you open the attached complex_modulator(SGL).vi you will see some enhancements I made with memory allocation as well as replacing the WHILE loop with a feedback node per your suggestion.

3&6. The complex modulator is what does the real to complex conversion that is required in my design. In the new attached version of complex_modulator(SGL).vi I convert the data to CSG before doing the PtByPt phase multiplication. Since I'm doing (I+jQ)*e^(jTheta), the Q value is just 0 for real data that is input to this VI. If I didn't convert to CSG though I get a coercion dot in the FOR loop that actaully made it take longer to execute the complex_modulator(SGL).vi. I did reduce the acquisition to SGL datatype instead of DBL and I also changed my code so that it stays to CSG after the real to complex conversion. This did help speed things up a little bit. My A/D will need 24 bits of resolution so I32 would have to be the precision used. It would be difficult to use Integer data throughout the processing code since the FFT functions don't have Integer selections, only SGL, DBL, CSG, and CDB.

5. I actually did find a sweet spot around an acquisition block size of 131072 real samples at 7.5MSps and a block processing size of 16384 real samples. The strange part is that, with these block sizes, I was able to keep up with one 7.5MSps channel but it does not keep up with processing two channels. I need to eventually be able to do three! Also, maybe this is documented somewhere but the one channel that I can keep up with only happens in the LV2010 development environment. If I build my top-level application into an executable and run it, the processing falls behind... This was strange because, in my experience, the exe's have typically run faster than in the development environment. As an initial guess I thought that this might be due to the new compiler optimizations that went into LV2010 but don't necessarilly apply to executables. Just a wild guess though.

I've also continued to turn off additional Win7 services and background processes as I find them on the internet. I should have kept a list of the things I've turned off for reference but, unfortunately, I forgot to do that initially. So, I can't tell you exactly what things I've tweaked in Windows. The more recent tweak that I've been trying here is reducing the Windows process scheduler tick period (higher resolution), using the timeBeginPeriod function in winmm.dll, so that my processing code gets a CPU slice faster than the default 15.6 ms that Windows checks. Unfortunately, that didn't seem to change anything but I'm not entirely sure that I'm using it correctly.

At this point it is looking like a hardware redesign might be needed but I'm still trying to avoid it if at all possible.

Thank you for the great advice,

Tim Sileo

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

TimS. · ‎09-30-2010

Also, here is some additional timing information that was provided in an email outside of this post. I just wanted to put it here as additional reference.

After first speeding up the lower level code significantly (complex_modulator and FreqDomainFilterAndDecimate), I then used the PM (Performance and Memory) utility to get an idea of the total time it took to process an entire acquisition block. The example that I originally used was an Acquisition block size of 32768 samples and a processing block size of 8192 samples. At 7.5 MSps, 32768 samples gives my processing code ~4.369ms to execute. By dividing (Total Time)/(# Runs), in the PM utility, for my top-level processing VI I get an average execution time around ~3.0ms (or under that sometimes). So, on the surface it appears that I’m processing fast enough. However, the longest duration column indicates 15.6ms but there is no insight into how many times this is occurring. I believe that this would be the thread execution jitter from using Windows OS. I have a 500 element Queue buffer between the digitization code and the processing code to mitigate some of the worst case processing blocks. Unfortunately, it appears that even though my average throughput is faster than my requirement I’m not catching up after a “longest duration” execution occurs.

I haven't redone this timing yet with the 131072/16384 block sizes.

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

HSD · ‎10-01-2010

If you are willing to sacrafice a little bit of resolution, you can set the binary sample width attribute to 16 and fetch binary 16 data. This will reduce the amount of data you are processing.

DFGray · ‎10-01-2010

It would not surprise me if your optimum fetch chunk size changed with multiple channels. It would depend on whether you are fetching the channels serially or in parallel from the same device. I would recommend fetching serially from the same device, in parallel from two devices, since PCIe does not share bus space between devices (PCI does, so that would always be done serially).

In your complex_modulator code, you can use autoindexing of the array on the input and output of the loop and eliminate the InPlace element in the loop. In theory, this should compile to virtually the same code, but you may get some speed-up.

I would not recommend messing with the slice time on your OS. Making it smaller creates greater overhead in the OS and can actually slow things down overall (have you tried making it bigger?). You do need to make sure your buffers and acquisition can handle about 30ms of delay. Catching up is another matter.

Good luck. Let us know how you do.

TimS. · ‎10-01-2010

DFGray,

Ultimately, if I can get the code fast enough, I will be using two 5922 digitizers to get data from three separate antennas. So, I'll be using two channels on one digitizer and one on the other. In order to have code that supports both single and multiple channels I am using the NI-Scope Fetch VI with "1D WDT" selected. If there is only one channel being used then I only index the first WDT element. Otherwise I index two elements from the digitized array and process both in parallel. Hopefully the figure below provides more detail and doesn't add any confusion:

I did change the complex_modulator as you suggested. I rebenchmarked the execution speed for the block size I was using and it's just about the same as with the InPlace element. So, good to know that autoindexing is using the same memory space which was what I was trying to enforce.

I haven't tried making the slice time larger but I think I'll stop messing with it. There are other processes that are requesting 10 ms instead of the 15.6 ms anyways so it never was at the default like I thought. LabVIEW as it turns out, requests a value of 1 ms when it first runs. You can test this out in Win7 by running the following command in cmd: "\Windows\system32\powercfg.exe -energy -duration 10". This will create an energy report html file in the system32 directory. Within the html there is a section on what the current system timer is set to in 100ns increments. It also lists all the applications that are requesting system timer values. Just open LabVIEW to the getting started window while the powercfg command is running and it will capture that LabVIEW.exe actually makes a call to winmm.dll (which I assume is calling the function timeBeginPeriod to set the system timer).

At this point I may have to skip doing processing on this Win7 box and move it off somewhere else. However, I still need to make sure that I can do Digitization and UDP transmissions fast enough since the Win7 PC will still be used to accomplish that.

Regards,

Tim Sileo

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

TimS. · ‎10-01-2010

HSD,

Unfortunately, my requirements dictate that I stay with atleast 18 bits of resolution. I suppose I could use the 2D I32 selection on the NI-Scope Fetch VI but my downstream processing only works on SGL, CSG, DBL, or CDB. So, I would take a hit at some point with converting from I32 to float.

Regards,

Tim Sileo

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

DFGray · ‎10-01-2010

I just remembered a further OS optimization that may help. Windows is optimized to give the GUI preference for time slices. LabVIEW, being highly multi-threaded, works better when all threads are given equal preference. By changing the optimization for background services, you can get better LabVIEW performance (although it probably will not be enough). To change it, right click on My Computer and select properties. Click the change settings button/option by the computer name, domain, and workgroup settings. When the System Properties dialog opens, click the Advanced tab and click the Settings button under the Performance section. When the Performance Options dialog opens, click the Advanced tab and change Adjust for best performance of: to Background services.

TimS. · ‎10-01-2010

I actually did try to play around with that a little bit as well. Another engineer here pointed me to this article http://blogs.msdn.com/b/embedded/archive/2006/03/04/543141.aspx. It actually gives you the registry key to change to get more options than just the foreground/background selection on the Performance Options dialog. The performance dialog actually changes the registry value but I don't remember what hex value it uses.

Thanks for thinking of even more things for me to try.

-Tim

Tim Sileo
RF Applications Engineer
National Instruments

You don’t stop running because you get old. You get old because you stop running. -Jack Kirk, From "Born to Run" by Christopher McDougall.

High-Speed Digitizers

Help with Scope Acq/processing speed

Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed

Re: Help with Scope Acq/processing speed