Performance of "Host Memory Buffer usage" in 9068/9651/9607

jiangliang · ‎06-23-2017

Dear all:

I noticed when I click properties of FPGA in 9068, I got a new page about DRAM, and at the end of general page, I found this below:

Memory outside the FPGA:
      DRAM:
            Banks: 1
            Bank Size: 4 MB
            Data Width: 64 bits

I'm more then happy to see there is a block of memory that we can access from FPGA, and I compliated the example VI provided, it works.

There I have two question:

1, What's kind of performance can I expect with this memory?

2, Will there be a cost at the RT CPU? or it is just use low-level data exchange?

Thanks!

a_clucker · ‎06-23-2017

Hi jiangjiang,

We wrote this White Paper to describe the new feature: http://www.ni.com/white-paper/53881/en/

Please read through it for a thorough explanation of how it works. There is a section at the bottom comparing performance against Read/Write Controls (registers) and DMA FIFOs.

If you are looking for more explicit numbers, we have included an example with CompactRIO 17.0 to allow you to easily benchmark and provide a reference architecture. Please open the Closed Control Loop (Host Memory Buffer) example and run it on your target. The example is created so that it should be easy for you to insert your RT and FPGA algorithms and benchmark for your application.

Andrew T.
"His job is to shed light, and not to master" - Robert Hunter

jiangliang · ‎06-24-2017

Thanks you very much for your answer.

My plan is use this as a picture buffer, my application need me to save frame in memory where FPGA can have a quick sequence access to it.

I used to use BRAM provided in FPGA, but the size is not large enough, and to aviod data overflow/lose detail, I saved each intermediate pixel as FXP18, I can only operate 256*256 picture, with HMB, I can operate @512*512*U32*4frame which would be great to hear.

The question I have now is:

Since there is no return about write memory success or not, would it be overrun if I continues write HMB sequential address, and would it be possible to read sequential address at the same time, at the speed of 40MHz， or even maybe 10MHz?

Thanks!

nturley · ‎06-26-2017

Since there is no return about write memory success or not, would it be overrun if I continues write HMB sequential address

The write method has a "Ready for Input" terminal. If you assert "Input Valid" before "Ready for Input" is asserted, it will be ignored. So there shouldn't be any buffer overrun if you use the handshake terminals correctly.

would it be possible to read sequential address at the same time, at the speed of 40MHz， or even maybe 10MHz?

You can read at the same time as writing but they are not completely independent so you may get a higher throughput doing one at a time.

You may be able to issue short bursts of requests at 10MHz or even 40MHz because the requests will be collected in the request FIFO, but I'm not confident the DMA interface can keep up with that rate. From some simple testing I have done, I've seen around 8MHz on average, assuming there isn't anything else on your system using DMA.

jiangliang · ‎06-27-2017

Thank you very, nturley!

In my mind I used to think HMB will cost two DMA channel, one for uplink and one for downlink, both operation can take place at the same time while introduction a large amount of latency, I can live with that kind of latency if the overall throughput is OK, but according to your word it seems only one direction is actived at certain monment, is that right?

Also, how many read requesting should I issue in queue? I set the maximum outstanding data to 512, does that mean I can issue 512 request before I get any results, and after I retrieved data I can then issue another 512 request? Sorry if that's a silly question but I never have used DRAM interface before.

nturley · ‎06-27-2017

Yeah, don't worry about it, it's a little confusing.

A DMA Channel is a collection of registers and buffers that manage bulk data transfer between the host and the FPGA. The general abstraction is a stream and so the registers and buffers are oriented around how full the buffer is, where the buffers are located in memory, how much data there is left to transfer, and so on. DMA FIFOs and the Scan Engine both use DMA Channels to transfer data. So you may notice that UDV's (which use the Scan Engine), consume two DMA Channels, one for each direction, just like you were envisioning.

Another important thing to understand is that DMA Channels aren't completely independent. They still compete for shared resources. If a shared resource is unavailable, then a DMA channel can have reduced performance.

An important shared resource is the system bus. On Zynq, all DMA goes through a single bus interface that is 64 bits wide and is run at 40MHz. So the maximum possible bandwidth is around 320MB/s. I believe it's possible to saturate that with a single FIFO, so if you have 4 FIFOs running at full speed, they will each have roughly 1/4 the bandwidth that they would have in isolation.

Another important shared resource is CPU. On Zynq, I believe that a single core can barely keep up with a full-speed FIFO. Even if you had twice the DMA bandwidth for DMA Channels, the CPU wouldn't be able to process the data fast enough to keep up.

We decided not to leverage DMA Channels for Host Memory Buffer but we are still on the same bus interface. Even though HMB isn't a DMA Channel it still competes for those shared resources, DMA bandwidth and CPU. All DMA Channels and HMB all compete for time on the same bus. So that 320MB/s is shared between HMB's, FIFOs, and the Scan Engine. And HMB tends to use that bandwidth less efficiently than the other two.

One of the pieces of infrastructure that was leveraged was the onboard DRAM interface. The onboard DRAM interface has a request FIFO. This request FIFO has a depth that is configurable (though the "Maximum outstanding data" setting). So if you set that setting to 512 and for some reason, HMB can't get any requests through the bus interface then 512 requests will be queued up before it won't accept any additional requests. This request FIFO has the addresses for reads and writes in the same FIFO, so if a write request blocks, then any read requests behind it will also block.

jiangliang · ‎06-28-2017

Thank you for your detailed explanation, it really helpful!

1. For the first part, I was think the HMB is a packed two DMA interface, so larger outstanding data request should help me get better performance, but I read from the guide line of HMB, it says HMB will write small number of data to host at a higher rate, does that mean larger outstanding data request is not only not increasing performance but actually decreasing it? Does that mean I should set max outstanding data value to the minimum value LabVIEW permit for better performance?

2.I have test HMB with FPGA operation only, if I set clock to 6M or lower, I can make sure at each clock at least one operation is processed, write data or read data, if the clock is set to higher, like 10M, write data and read data operate add together is less then the total cycle number. But if I only write to HMB, and always wire F to Read node, I can get about 60% rate of write success, seems one direction operation does bring much performance increase, which does fit your explanation.

3.I know I can issue request before I get the valid memory read, but if I do this in a stream fashion, how can I make sure the data I just get from Retrieve Data is the right data for memory address? Does it means I should tracking the Output valid from Retrieve Data node and mapping them to the address I issued to Request Data？

Thank you!

nturley · ‎06-28-2017

HMB will write small number of data to host at a higher rate

That's not quite right. HMB will take less time to transfer a small amount of data to the host than DMA FIFOs would. Latency is the amount of time it takes to transfer the first element. Throughput is the number of elements that can be transferred in some amount of time.

Transfer time = latency + number of elements / throughput

Your payloads are quite large so I think that the latency will be negligible the only metric that matters for you is throughput. If the highest rate you can set your clock to is 6MHz and the data width for HMB is 8 bytes, then you are measuring a throughput of 48MB/s. As I said before, the maximum DMA bandwidth is 320MB/s, so you are measuring between 1/6 and 1/7 the theoretical bandwidth of the bus, which is close to what I remembered measuring.

Your payload is 512*512*U32*4, which can be repacked into 512*256*U64*4. U64 = 8 bytes, so your total payload is 4MB. So you measured 48MB/s and your payload is 4MB, so you can expect a transfer time or 83ms or so. If you are simultaneously reading and writing, we can double that and we get 166ms. Which if you are doing this continuously you can expect to do a full read and write 6 times per second.

If you are reading and writing sequentially and you've got enough CPU then you can use DMA FIFOs instead and do a full read and write about 36 times per second.

Real-Time Measurement and Control

Performance of "Host Memory Buffer usage" in 9068/9651/9607

Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607

Re: Performance of "Host Memory Buffer usage" in 9068/9651/9607