First thank you Josh and DFGray for your interesting answers. To clarify, I use Visual C and I hava PCI system.
I played around with the chunk size and from about 256k to 512k I find optimal performance for 2*16.6 MHz sample rate. I also changed to a non-zero timeout value. However, when using 2*20 MHz always 100% of one CPU (P4, 3GHz, HT disabled) was occupied. If I use a 'bad' chunk size there, the old data gets overwritten, before I got transferred in the main memory and the rountine stops.
It is very interesting that for 2*10 MHz only 10% of one CPU (P4, 3GHz, HT disabled) was used, and for 2*2MHz the CPU was virtually idle. There seems to be a strong non-linear increase when the data-transfer rate approaches 40 MSamples/s. By the way, this was
the maximum transfer rate I could get continuously.
A major improvement was found when I activated the Hyper-threading Technology HT on the Pentium IV. After that the 2*20 MHz used only 50 % of one CPU, i.e. the DMA transfer blocks one of two virtual CPU of one P4. Is there e an explaination for that?
Regarding the direct DMA transfer: The NI-SCOPE is now very easy to use and I highly appreciate that I do not have to care for the details you mentioned. However, for our application (GPS-signal processing) we would actually need as much processing power as we can get (basically to increase the number of channels). So if there is some more or less easy way to reduce the processing load used for the DMA transfer and also to increase the contiuous transfer rate to 2*33.3 MHz I would be very interested in that.
thanks again, thomas