Using FPGA to analyze data from disk

nkmath · ‎08-31-2015

Hi All,

So I am attempting to use an FPGA to analyze stored data from disk with the idea that it will help decrease computation time which is a major bottleneck. I am using a NI 5761 digitizer paired with a PXIe-7965R FPGA to digitize two channels at 125MHz with 8 bits of resolution per channel. You may wonder why I am not just performing the compution upon digitization. That is because we will need to use seperate digitizers at seperate locations and then analyze the data between the different streams. Thus, we need to first store the data, then analyze it offline.

My first question is if this is even reasonable to expect that it will help reduce computation time. The calculation requires us to perform a cross-correlation between each of the data streams over hour-long timescales. Currently I am just unpacking the data from the TDMS file, and using the cross-correlation VI in an iterative fashion.

The next issue is how to implement the FPGA approach. I found the "High Throughput Streaming - From Disk' VI example which describes how to stream data from disk. I see that in the FPGA example (7965R) there is an output FIFO which retrieves the data and I imagine I can add my analysis code to that data within the FPGA block diagram. I am unsure of what exactly is extracted. If I've stored the data as 1 element U32, my impression is that I should tell the output FIFO to retrieve the same data (1 element of U32) and my guess is that I am retrieving the data in the same way it is stored... If this is the case then the rest I think I can figure out..

I can attach code if needed but I think I would first like to understand conceptually how the FPGA accesses TDMS data and if that can actually improve computation time.

Jacobson-ni · ‎08-31-2015

I think you are on the right track for reading your file from disk. As you know you cannot read and write files directly from your FPGA so you will need to read them from TDMS on your host and use a host to target DMA FIFO to send them down to the FPGA. On your FPGA you do whatever analysis you need and send it back to your host in a target to host FIFO. In terms of data types you just have to make sure you won't have overflows unless you want to bit pack.

Not sure if this will speed anything up but just wondering, how much data do you need to run through this analysis and how long is it currently taking?

Matt J | National Instruments | CLA

nkmath · ‎09-01-2015

Hi Matt,

Thanks for the response. Yes, I think I understand what you're saying about the transfer process to retrieve the data through a FIFO. My confusion now comes when the FPGA initally pulls the data from the FIFO... the way I have I described it I think it will pull 1 element on each clock cycle (125MHz) but if I change the settings on the FIFO could I allow for more elements to be transferred at once (like an array of values)? I of course then will need to adjust the clock to such that I don't go over any throughput limitations.

About the timing - my naive assumption is if that I could run the same analysis upon digitization within the FPGA, where the analysis process is essentially real-time, then analyzing from disk could also be done in real time assuming the read/write limitations are equivalent.

Currently, we digitize data at 500MB/s and do this for at least couple minutes. Eventually we will need to scale this up to hour long timescales, thus any given run has something like 30GB (1 min.) to 2 TB (~1 hour). The analysis is already taking several hours to process just a couple minutes of digitized data. I imagine that I could continue to optimize my code that performs the cross-correlation but would need an improvement by a factor of 100 to make it usable. My code is certainly not the most efficient but I'm not entirely sure how to gain a 100x improvement.. maybe I can address this issue in another thread.

Nolan M.

nathand · ‎09-01-2015

nkmath wrote:

the way I have I described it I think it will pull 1 element on each clock cycle (125MHz) but if I change the settings on the FIFO could I allow for more elements to be transferred at once (like an array of values)?

On the FPGA side, you cannot pull more than one element per cycle from a FIFO.

@nkmath wrote:

Currently, we digitize data at 500MB/s and do this for at least couple minutes. Eventually we will need to scale this up to hour long timescales, thus any given run has something like 30GB (1 min.) to 2 TB (~1 hour). The analysis is already taking several hours to process just a couple minutes of digitized data. I imagine that I could continue to optimize my code that performs the cross-correlation but would need an improvement by a factor of 100 to make it usable. My code is certainly not the most efficient but I'm not entirely sure how to gain a 100x improvement.. maybe I can address this issue in another thread.

Posting your code in a new thread sounds like a good plan. It's amazing what the experts on this forum are able to suggest to improve performance.

tcap · ‎09-02-2015

Hello Noah,

With the 7965 FlexRIO, you are capable of sending 64 bits of data per cycle. This is configured in the Interfaces and Data Type sections of the DMA FIFO Project item. So you should be able to read and write an 8 element array of 8 bit numbers per cycle. Additionally, you could potentially set up multiple DMA FIFOs in each direction, increasing your throughput per cycle.

According to the 7965 Product page, you should be able to achieve more than 800MB/s of throughput, so I don't know if you will be able to stream the data down to the FPGA at 500MB/s and simulaneously up to the host at 500MB/s. Regardless, this could still offer you an improvement.

This FPGA implementation will be much more complicated to implement, so I recommend exploring optimizing your host side program to see if you can get a large improvement as it will be much simplier. You might want to benchmark your current processing to determine your data throughput and compare it to a theoricial FPGA throughput to determine whether it is valuable or not.

As nathand said, I recommend making a new thread for the host side optimization.

Regards,

Thomas C.
FlexRIO Product Support Engineer
National Instruments

nkmath · ‎09-02-2015

Hi Thomas,

Appreciate the response. Just to clarify, when you mention that we can send up to 64 bits/cycle which could correspond to 8 bit array of 8 numbers I assume that those numbers have to be packed into a single 64 bit element (or word) and unpacked into an array post-retrieval. As far as I understand pulling out an array of 8 bit numbers as is, would not be supported through the FIFO.

According to the 7965 Product page, you should be able to achieve more than 800MB/s of throughput, so I don't know if you will be able to stream the data down to the FPGA at 500MB/s and simulaneously up to the host at 500MB/s. Regardless, this could still offer you an improvement.

I am only trying to stream data from disk to the FPGA and analyze it there at 500MB/s. The full analysis would be performed within the FPGA with the desired final result of an array of about 100 elements. That would be the only thing I would potentially store back to disk, so no need to pipe data back to the host at the same rate unless I am misunderstanding something (very possible!).

As nathand said, I recommend making a new thread for the host side optimization.

I have done so which you can now find here, http://forums.ni.com/t5/LabVIEW/Help-in-optimizing-cross-correlation-routine/td-p/3186204 I will still continue working/thinking about the FPGA implementation as it's an interesting problem but in the meantime I'll take your advice and try to optimize the inital code.

Nolan M.

tcap · ‎09-03-2015

Hey Nolan,

If you open up a new project, add a 7965, add a DMA fifo, set the data type to I8 or U8, select interface, you can set the number of read or write to be 8. This allows you to write or read an 8 element array of 8bit numbers without bit packing. I recommend giving it a try! If you need more elements, you can always run multiple DMA FIFOs in parallel.

With regards to your processing, I didn't realize you were outputting only 100 elements, this could improve your streaming to the FPGA. The problem that you might run into is that the FPGA has a limited amount of storage, for example the 7965 only has 512MB of DRAM, which is about 1 second of data. If your algorithm requires storing all of the data on the FPGA, you will run into a memory limitation. I recommend determining how much data you will need at any given time during your algorithm.

I still recommend benchmarking your host side application, to determine if your development time is worth while.

Regards,

Thomas C.
FlexRIO Product Support Engineer
National Instruments

LabVIEW

Using FPGA to analyze data from disk

Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk

Re: Using FPGA to analyze data from disk