I don't think the way DMA buffering works is your problem so much as the sequencing of the reads and writes in your program. In your FPGA program, the reads and writes occur in parallel. The way the code is written, the output I/O won't update until a point has been read from the AO DMA FIFO, and the sampled input won't be written until space is available inside the AI DMA FIFO. However, on the host VI, the reads and writes are serialized and the read precedes the write inside the loop. This means the first time the loop runs, the host VI will try to read some number of points from the FPGA VI before the write will occur. From the FPGA side, the first time the loop runs, the AI data will be sampled and written to the AI FIFO, but the read from the AO FIFO will eventually timeout since you haven't written anything from your host VI. The host VI didn't write anything because it was still waiting for N points from the FPGA VI. In essence, you've effectively created a temporary deadlock. Eventually, the write timeout on the AO DMA FIFO (FPGA code) or the read timeout on the AI DMA FIFO (host code) will expire and things will start rolling again. I haven't worked everything out in my head, but I would expect the amount of buffered data available to the input and output slides by one sample each iteration of the loop until eventually it hits zero for one of them. At this point, you will lose another point of data and the cycle will repeat. To fix this, you either need reason through your sequencing to ensure this can't happen (it might be as simple as putting the write before the read in your host VI), or you should run the reads and writes in parallel in both VIs. I would also recommend you create some logic around the timeout indicators on the FPGA and host VIs as debugging aids (or just for peace of mind) if you don't ever expect that a timeout should occur. Either that, or you will need to write your program so that it can tolerate timeouts without loss of data if that's important for your application.
Finally, with this sort of stimulus / response or wrap back application, you should be aware of the input delay with the 9239. Because this module internally performs digital filtering as part of the conversion process, there is an inherent delay from when a change in the external signal is seen in the digitized data. This is referred to as the input delay in the specifications of the module and varies as a function of the sample rate. This means the acquired signal from the 9239 will lag the signal generated by the 9263 by more than a single sample. However, the lag should be constant throughout the acquisition and not "slide by a sample".