DAQmx read 10x overhead dependent on number of samples chosen

tcnojin · ‎11-13-2017

Hi all,

I'm trying to make a VI where I read data from a NI PXI-6115 DAQ device by triggering it 150 times, building an array, and later operating on that array.

So I have a for loop nested within another. The nested for loop does the read 150 times before handing the data over to the outer loop (Please see attached file). In the screenshot, I've removed everything that processes the data and another DAQ task that does Analog output on another device but it is still slow. Also the triggering is disconnected, but still slow.

I'm trying to read 1000 samples at 10MHz each 150 times, with a 10kHz TTL external square wave trigger. Without any overhead I suppose it would be (0.2ms *150) => 30ms, but I'm getting frame times of about 300ms on a NI PXIe-8135 when reading 1000 samples and setting DAQmx read to 1000 samples per channel.

However! Everything changes when I change the number of samples in DAQmx Sample Clock and samples per channel in DAQmx Read.

| 1000 | | 1000 | | 300ms |

| 1024 | | 1000 | | 22ms |

| 1024 | | 1024 | | 40ms | (Why twice slower here)

| 100 | | 100 | | 165ms |

| 10 | | 10 | | 302ms | (What! Why??)

| 2048 | | 2000 | | 30ms |

| 2048 | | 2048 | | 55ms |

| 2047 | | 2000 | | 115ms |

| 2046 | | 2000 | | 95ms |

| 8192 | | 8100 | | 165ms |

I thought my hardware was maybe dying, so I tried the same thing on a PXI-8820 with a different PXI-6115 connected, similar results.

Do I have something seriously setup wrong that's causing this? Placing start task and end task within the loop doesn't change much... I really can't believe it would be so dependent on these numbers. I set up a similar loop a few years ago and don't remember having all this trouble.

Kevin_Price · ‎11-14-2017

Your code is a little unconventional so I'd rather focus on *eliminating* the quirks you're seeing than try to explain them.

1. You should add an explicit DAQmx Start and DAQmx Stop. The right place to put them depends on exactly what you're looking to characterize.

2. You need to more carefully define the sequencing for when the msec Tick count is queried relative to the DAQmx functions.

3. Your table seems to list only the final "Frame Time". You'll get better data by looking at the distribution of times you build on the outer loop boundary.

4. Benchmarking while building an unbounded array on a While loop is asking for quirky looking timing results. Set the outer loop to be a For Loop with a constant # iterations. You can still add the conditional terminal to allow early termination, but at least the compiler will know how big an array to allocate for timing data before code executes.

5. Future tip: be careful of pure equality termination conditions as seen in your inner loop. You're fine for now when using integers, but there are classic pitfalls when folks try to terminate by checking equality on a floating point value.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

tcnojin · ‎11-14-2017

Okay, I've tried to implement your suggestions and here are the results. Some inconsistencies are fixed but the main one remains.

Over 49 trials in the For Loop:

| 1000 | | 1000 | | 299.27ms | | 4.39 |

| 1024 | | 1000 | | 38.92ms | | 1.17 | This type of mean time is my expectation

| 1024 | | 1024 | | 39.35ms | | 0.95 | Changing DAQmx read now has no effect

| 100 | | 100 | | 165.90ms | | 2.11 |

| 10 | | 10 | | 299.37ms | | 7.01 | Much slower than expectation

| 16 | | 16 | | 26.43ms | | 0.58 | Much more reasonable

| 2048 | | 2000 | | 54.84ms | | 0.80 |

| 2048 | | 2048 | | 56.43ms | | 1.44 |

| 2047 | | 2000 | | 193.33ms | | 0.75 |

| 2046 | | 2000 | | 195.02ms | | 0.92 |

| 8192 | | 8100 | | 156.76ms | | 7.87 |

| 32000 | | 32000 | | 519.22ms | | 1.31 |

| 32001 | | 32001 | | 741.37ms | | 12.48 | Ouch, that one sample. Consistent over multiple runs

| 31999 | | 31999 | | 752.53ms | | 2.99 | Don't lose a sample either...

My CPU load during this is about 3%, and there aren't any spikes in the data from windows doing some task scheduling or what have you.

From the hardware point of view all of the power of two entries being faster makes some kind of sense I suppose but I'm not sure where it happens (on the ADC card itself, Windows behavior for memory reservation, etc) The 32000 is a bit strange but maybe (2^5) * 1000?

If it is a memory alloc issue why doesn't the driver or whatever request the next biggest block up instead of what is happening now?

If there is some place in the manual or datasheet I missed about this that explicitly points out this behavior, fine. But if somebody can explain this I'll mark an answer.

I've implemented this back into my main program (choosing the 'correct' # of samples) and the speedup is still there even with 10kHz triggering and everything is behaving much better. I might try to loop over sample choices for small numbers (<2000) and see what the trend looks like.

JensG69 · ‎11-14-2017

BTW, if you want the while loop to run 150 times, than compare i=149, not i=150. The loop counter starts at zero not one.

Regards, Jens

Kudos are welcome...

Kevin_Price · ‎11-14-2017

Some of your numbers are at least curious, but they don't concern me unless they really mean something useful. What exactly are you trying to understand about task overhead that leads you to re-run a task 7500 times and collect timing stats on it? I want to be sure the benchmarking test mirrors real-life in a practical enough way that the data is going to be actionable rather than just a, "huh, whuddya know?"

I'd have expected the start and stop to occur inside the outer loop but outside the inner loop. Otherwise, why have two loops?

If you were to do that, you'd be measuring the time to read 150*N samples (actually 151*N, but you can fix that) plus the overhead of start & stop. You'd then have stats based on doing that measurement 50 times. Note that you need an initial value for Tick Count. I'd just drop another Tick Count primitive down outside the sequence structure and feed it through to the 2nd frame where you find the difference. Then you can get rid of the shift register too.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

tcnojin · ‎11-14-2017

Kevin - In my task I want it to run as fast as possible. I spent a whole day struggling why everything was running so slowly when I was requesting 1000 samples. I tried every suggestion from my forum search for things like 'DAQmx read slow'. On a complete whim at the end of the day I changed my samples from 1000 to 1024 and my Frame Time dropped from 300ms -> 40ms. I dove into help and support files such as determining buffer sizes (http://zone.ni.com/reference/en-XX/help/370466Y-01/mxcncpts/buffersize/) to find if there was some recommendation for sample sizes when desiring the quickest acquisition but there was nothing I could find.

So, my original question is: Is the dependency on sample size an artifact of using the DAQmx library wrong in my code, or is this inherent performance in the DAQmx library?

If my code can be fixed to have predictable run times regardless of small differences in sample sizes, I would like to know. If my code is OK and there are recommendations, I would also like to know where to find that information.

I have two loops because it mimics what is happening in the actual task I'm using. In the 'real' task I'm controlling a second PXI board that outputs a single sample AO control on a 64 line bus before each of the 150 measurements to control exterior equipment. Then, once all 150 measurements are collected from the 150 different configurations of the exterior equipment, the result is displayed on screen and output to file. Then the outer loop makes it all happen again because the exterior equipment is monitoring something. Sorry to be vague but, confidentiality reasons.

All of this AO control, display to screen, etc, adds nothing to the loop time compared to the ADC read (at least down to a millisecond).

So finally, I checked all the sample numbers because if there is nothing in the manual or data sheet about this, I want to know in the future what number of samples is best to use. It turns out any multiple of 16 the best. At least for this code anyway.

Using a sample size in multiples of 16 in my real task, controlling real exterior equipment, speeds up the loop by nearly the same amount as it does in this example code. So, at least that is useful to me. Unless there is a way to remove the speed being dependent on sample sizes by changing something in my code, I'd be more than happy to hear suggestions.

tcnojin · ‎11-14-2017

This is the result of checking sample sizes from 2 to 2000. It ran a few times and the results are all very similar.

I forgot a legend, so the blue line is the time it took for 151 iterations. Green line is the variance.

Every multiple of 16 gives the fastest acquisition time by far. For whatever reason from 550 samples to 800 samples the 'bad' number of samples timing increases from ~160 to ~300 then modulates every 128 samples.

Kevin_Price · ‎11-15-2017

First a little fair disclosure. My background includes quite a lot of DAQmx work and none of the stuff you're describing has ever gotten on my radar. So I started this discussion from a stance of skepticism, needing to be convinced.

I still have some questions about task timing, but the data you presented sure does show a convincing preference for sample quantities that are a multiple of 16. The behavior for other quantities is a little inexplicable though. At the low end, it kinda makes sense that it's essentially flat -- here the time it takes to advance the code from Start to Read is long enough that all the samples are already there when Read executes. Then there's a region where time increases pretty linearly with sample count. This also makes sense b/c now you add time to wait for the additional samples to be taken. More samples makes for longer time. All good so far. But then the high end reverts to being essentially flat again. I find that surprising also, though I agree that the multiple of 16 effect is more striking.

My next theory is to suspect that this behavior may possibly be unique to Finite Sampling tasks. In my own work, I've usually used Continuous tasks. When I've used Finite tasks, I haven't needed to stop and restart them in such a rapid-fire way where overhead would become both important and visible. (BTW, the "commit" you're already doing is the best thing I know of to lower the overhead for stopping and restarting a Finite task.)

Can you try another couple experiments, both using Continuous Sampling? And just to close the loop on things, can you post the code used for the benchmarking?

1. Leave everything else the same with Start-Read-Stop inside the inner loop.

2. Move the Start and Stop outside the inner loop. Note: you may get buffer overflow errors at high sample rates and low # samples read per loop.

All in all, I think the best path forward is to look for a way not to *need* to stop and restart your task via software. You say that another board generates an AO sample before each of the 150 inner loop measurements. Well, maybe you can use that other board's AO sample clock as a trigger for this AI task, *and* make this AI task retriggerable. (Your board may not support this directly but there are workarounds). Another option might be to add another channel to your AI task and wire it to the AO signal. Then you could do continuous acquisition and post-process to do your correlation. Kinda ugly and indirect, but it could be made to work.

There may be some other techniques to sync your AI data to the other stuff going on. PXI supports a lot of flexible routing of trigger and clock signals amongst boards. A more complete description of the timing relationships among these things could also help. (How often does the other board generate new AO samples? Is it hardware timed?) What other boards are available for your PXI chassis? If you have another AI board, try your timing experiments on the other board.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

tcnojin · ‎11-15-2017

Kevin,

Thank you for all your help.

I put the code here for everyone's perusal. I made the DAQmx configuration into a subVi to make things a bit cleaner.

Unfortunately I've run into another weird thing that makes me feel like my code is not setup properly. Adding the external trigger keeps the speedup ONLY if the AO output is in the same sequence structure as the AI DAQmx Start vi. Even if the AO output channel is completely deleted, everything becomes slow again. I attached a picture to show.

Kevin_Price · ‎11-16-2017

Still theorizing here. I briefly dabbled with a version of your example. Had to change the sample rate to something my MIO board supported though.

I put in probes and stuff to get a quick look. This isn't ideal for true benchmarking, but I've only got time for a quick check. I saw pretty erratic and inconsistent timing from the "Start-Read-Stop" frame.

I changed the config to do a simple "verify" instead of a "commit". As expected, the times got significantly longer. They also remained erratic. I reverted back to "commit".

Next I changed the sampling mode to "Continuous" instead of "Finite". AHA! This made an absolutely HUGE difference in timing consistency. I was getting 21 msec +/- about 3 msec over more than 100 measurements. A further switch to "verify" bumped the timing up to more like 200 msec, but it remained very consistent.

So, if you *MUST* stop and restart your task, it appears you'll get much better timing consistency by configuring the task as though it will be continuous. You'll still be free to read a finite # samples from it and then stop it prematurely.

I'd still recommend a harder look and more consideration to an approach that doesn't require all these software-timed stops and restarts.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

Multifunction DAQ

DAQmx read 10x overhead dependent on number of samples chosen

DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen

Re: DAQmx read 10x overhead dependent on number of samples chosen