LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Multiplexed multiply to conserve resources on FPGA?

Solved!
Go to solution

Hi,

 

I have a lot of high-throughput multiplies in my FPGA vi, and I'm quickly running out of space (DSP48E's in particular). I have multiple subVIs each with parallel and/or sequential mutiplies (and additions). I'm trying to save space, and I thought that creating a multiplexed multiply would help. I know non-reentrant vi's save space by re-using hardware, but I would like to save space within a subVI (also, most of my subVI's are only called once, so non-reentrant won't help). With a multiplexed multiply you would sacrifice parallelism, but potentially save a lot of space. I want to use it in a single cycle timed loop.

 

Does anything like this exist?

 

I'm currently struggling through the handshaking logic to make it myself. I'll post it when finished, but wanted to see if this already exists.

0 Kudos
Message 1 of 3
(4,257 Views)

Rewriting individual processing blocks to share a multiplier by looping through a state machine or similar design pattern is usually the best way to go. You can use a non-reentrant subvi if the calls are located somewhere near teh top-level, but once you put a blocking call like that into a reentrant subvi hierarchy, it is very hard to control how the resources are used.

 

I have once or twice in the past implemented a separate little processing block that allowed people to post things to be processed into a queue and then i manually replicate the processing block and use a round robin scheduler to feed each packet through the shared resources. It's doable, but it only works for larger processing blocks because of the queuing overhead. 

Message 2 of 3
(4,254 Views)
Solution
Accepted by topic author Andrew_P

I worked out the logic for the handshaking signals. The output valid signal goes high for one cycle (haven't checked to see if that's how the rest of the FPGA vi's do handshaking). Here's an example of 4 multiplexed inputs/outputs, using a single high-throughput multiply block. It is designed for operation in a single cycle timed loop, and takes a total of six clock ticks (4+2). This code can be modified easily to adjust the number of multiplexed inputs/outputs and fxp configuration. Someone could probably streamline the state-machine logic, and modify it to handle pipelining/registers in the multiply block. This block could be made non-reentrant to save even more resources.

In a particular project, I used this multiplexed multiply twice (replaced two sets of four parallel multiplies, each with signed 20 bit, 5 bit word length fixed point inputs and output), and the resource utilization comparison is shown below. The number of DSP48Es was reduced by 6, and other logic (slices, flip flops, LUTs) was increased slightly.

w/o multiplexed multiplies:
Total Slices: 46.4% (13367 out of 28800)
Flip Flops: 25.3% (7280 out of 28800)
Total LUTs: 41.5% (11959 out of 28800)
DSP48Es: 91.7% (44 out of 48)
Block RAMs: 0.0% (0 out of 48)

w/ multiplexed multiplies:
Total Slices: 47% (13581 out of 28800)
Flip Flops: 25% (7481 out of 28800)
Total LUTs: 42% (12296 out of 28800)
DSP48Es: 79% (38 out of 48)
Block RAMs: 0% (0 out of 48)

Message 3 of 3
(4,239 Views)