Slow transfer from target (USRP) to host

Mihai_Liviu · ‎03-22-2018

Dear members,

I have designed and tested a block which computes the cross-correlation between two received signals on the USRP's FPGA. I am trying to transfer resulted data to the host (PC), but the transfer is much slower than expected. I am using two DMA FIFOs for this task, and want to transfer 2700 samples of FXP <64,64> each 150microseconds. That should yield about 21.6KB/150us for each FIFO, which is equal to 144MB/s. Therefore, the total summed throughput would be 288MB/s.

I am able to transfer this data for about a few seconds, than the FPGA-based FIFOs signal they are not ready for input anymore (I have tied the "ready for input" node to an AUX I/O pin, which I check with the oscilloscope).

Is this the maximum limitation, or am I doing something wrong?

Best regards,

Mihai

Mihai_Liviu · ‎03-26-2018

Let's make this issue even simpler.

In a slow SCTL (5MHz) on the FPGA, I am using a counter to generate numbers between 0 and 2689. The numbers are U16 format, and are delivered to a DMA target-to-host FIFO. It's input is always valid, because I want to stream these numbers to the host continuously.

The issue is that the host FIFO gets full after a few moments of running. I expected to be able to stream these numbers continuously, since I send 16bits / 0.2 microseconds. This means 80 megabits/ second, or 10MB/second actually, which I think is supported by the PCI ex 4x interface.

The .vi on the FPGA site:

And on the host side:

Am I doing something obviously wrong?

DylanDylanDylan · ‎03-26-2018

Although this KB doesn't apply to your issue, it has (under Solutions) items which improve throughput.

https://knowledge.ni.com/KnowledgeArticleDetails?id=kA00Z000000P9zTSAS

Can you try the suggestions under Solutions (starting from the top) except the 2nd bullet?

psisterhen · ‎03-26-2018

How are you determining that the FIFO is full? From the Ready for Input signal you are routing to Aux I/O 1?

First thought is: you are pumping data into the FIFO from the FPGA as soon as the FPGA VI starts running. This will be as soon as the bitfile is downloaded- which will be well before you start reading data on the Host. So, depending on how large you have configured the FIFO on the FPGA side, you could be filling it before you even have a chance to read it out on the Host. I recommend gating Input Valid with a signal from the Host, so you don't start filling the FIFO before you are ready to read it out on the Host.

Additional thoughts:

How large is the FIFO on the Host side? You can configure it to be much larger than the FIFO in the FPGA to make sure sufficient memory is allocated on the Host side
Are you sure you are actually Starting the FIFO from the Host side?
See the Configure Stream.vi included in the Sample Project for examples of how to do both of the things mentioned above.

Mihai_Liviu · ‎04-01-2018

DylanDylanDylan,

The solution in the link was useful. I have tried this, and it somehow increased my throughput.

I have placed the above methods in my while loop. Now, at each run, the number of elements is obviously not constant. The above approach just empties the host side FIFO of all it's elements. However, it is much faster that the previous method.

psisterhen,

I was looking at the signal on AUX I/O 1 to know when the fifo is full.

You are right. I have modified the FPGA side and gated the input valid with a signal from the Host.

The FIFO on the host site is as large as labview allows (121 million samples in my case).

I am sure that I am starting the host side of the fifo (I use Start, Stop, Configure and Start methods).

My applications issues a fixed chunk of 5378 samples each 200microseconds. That means 26.9 milion of samples, with 64bits each, every second, which means 215MB/s. It seems that I am overdemanding on the host PC, to transfer and plot that much data each second. Is there a more efficient way of doing this?

Thank you,

Mihai

DylanDylanDylan · ‎04-02-2018

Hi Mihai,

An NI Systems Engineer recently created examples which show how to be extremely efficient when streaming to/from a USRP-RIO. Please reference the project here (this is new and we are working on integrating it to ensure it's easier to find!)

https://github.com/NISystemsEngineering/USRP-RIO-Streaming

Please check it out (read the readme) and give us any feedback/questions you may have!

Keep in mind these are very minimal projects which only implements streaming (DMA), it doesn't even utilize RF so further integration will be needed for your application.

psisterhen · ‎04-04-2018

If you are connected to the host via a x4 PCIe link (i.e. to a MXIe card in a desktop or PXIe chassis), you will be able to stream 215 MB/s easily. If you are using a x1 PCIe link, say with a cabled PCIe card on a laptop, you will not be able to stream more than ~200 MB/s.

Now, even if you can successfully stream data to memory, you may not be able to process and display it that quickly. Some tips:

if possible, collect all the data you need first and process after the acquisition and fetching have completed
Separate the processing loops and fetching loops so that the processing can run on a separate thread and, potentially, on a separate processor core.
You can use the DRAM on the device to store data and fetch it to the host later.

Mihai_Liviu · ‎04-19-2018

@DylanDylanDylan

Thanks for the example project. I was quite happy to find that someone is trying to address this issue. I have added some of the elements for Dummy DMA FPGA vi to my project, than carefully used the loops found in USRP Dummy DMA to Memory (Host).vi on my host interface.

While it seems to be working a little faster, it is not fast enough. I have the following observations:

1. In the FPGA vi, you used the "timeout" type of interface to the FIFO, as if it is constantly delivering samples. I want to use "handshaking" type of interface, since I am conditioned by some upstream blocks. Is that possible?

2. I also want to use 2 FIFOs instead of just one, with simultaneous transfer. How will that function on the host? it means I have to double the fetch loop (green one) and data loop (yellow one) ? (allready tried that).

3. I see that the host vi uses a queue to store the elements received from the FIFO. It also uses a "lossy enqueue element" vi. Using this one, the transfer is fast, but clearly suffers from loss of samples. If i replace it with "enqueue element" vi (without loss), the host side FIFO gets full very quickly - which is same issue as before.

Also, I am displaying the received samples in a plot, to see them real time. That could use host memory. It works somewhat faster after I delete the plot.

@ psisterhen

I am using a PCIe card in a desktop, with PCIe x4 cable connection, exactly this model: http://www.ni.com/ro-ro/support/model.pcie-8371.html.

Q:if possible, collect all the data you need first and process after the acquisition and fetching have completed

A: That is not possible. I am trying to build a real time aplication, with live acquistion and processing.

Q:Separate the processing loops and fetching loops so that the processing can run on a separate thread and, potentially, on a separate processor core.

A: I do not know how to do that in Labview yet. Some said that this is done intrinsically by labview.

Best regards,

Mihai

hoffiz · ‎04-20-2018

Hi @Mihai_Liviu

You are more than welcome to contribute to the GitHub repo. My recommendation is to create a FPGA example, like the dummy one in the GitHub, let's remove all the noise and concentrate on moving a lot of data. This will help you simulate some of your questions and test them out isolated from the rest of the code.

1. Yes, it is possible to use handshaking. You can look at some of the FlexRIO example and leverage them. Search LabVIEW examples for "High Throughput". LabVIEW->Help->Find Examples.

2. If we assume that one DMA runs a bit faster than the other, let's say one has 10 samples while the other 11 you need to handle the control of this FIFO (on the HOST), independently. That does not necessarily mean two loops but that will be the easiest way. You could also handle it in one loop with some logic and case structures.

3. Those displays are not meant to capture all the data but to show a snapshot of what is happening, but you are right, some data will be lost. I could be wrong but I doubt you really need to display all the data. You could process and display a processed summary, just thinking out loud.

Let's forget about LabVIEW for a sec 🙂

Would you be so kind to re-calculate how you get 288 MB/s but based on IQ Rate?

Jaime Hoffiz
National Instruments
Product Expert

HoSsEiN · ‎02-25-2019

Hello All,

Was there a solution for Mihai's issue?

I am having the same problem and not able to sustain 200 MS/s using the USRP RIO drivers on NI USRP-2944R (160 MHz BW) through NI PCIe-8371 using any of the FPGA example projects for simultaneous RX/TX.

Any help is appreciated... thanks,

HoSsEiN

USRP Software Radio

Slow transfer from target (USRP) to host

Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host

Re: Slow transfer from target (USRP) to host