Hi Dave, 
PCMCIA does not provide standard DMA support.  It's a limitation of the bus. 
The main problem is that every register IO goes thru a DeviceIoControl call.  Every time a DeviceIoControl call is made, there's a context switch from your application to the driver and back.  This is inefficient.  Tipically drivers implement the IO on the driver side and provide an API the applciation can use to program the device.  For example, you can have a configure function called by the application that performs all the device configuration.  Some of the reasons why MHDDK is implemented this way are:
- In order to keep the examples as OS independent as possible they are implemented in user mode.  The kernel driver only needs to provide access to the hardware, which usually means a simple OS-specifc driver.  Keeping the kernel driver simple makes it to port or create from scratch to port the MHDDK examples.  Also,  the examples can focus on illustrating the device programming.
- PCI devices are memory mapped.  If you look at the resources of a PCI device in the Windows Device Manager you'll see "Memory Range" resources.  These address spaces can be mapped to the user application.  Once the memory is mapped the application can communicate directly with the hardware using memory IO (pointer accesses), which is very fast and doesn't cause an user-kernel transition.  
- NI PCMCIA devices use IO space (IO Range in the Device Manager).  On x86 processors this means that access to the device is done using another set of assembly funcions (port IO).  PCMCIA IO space cannot be mapped.  To keep the same PCI example the driver provides functions to do the IO on behalf of the application.
The best way to improve performance is to move part of the example code to the driver.   This is essentially option 3 on your list.  You would be writing your own driver, or at least some support functions.  For example, a new Io Control code would be used to read from the FIFO.  The code in the example used to read the FIFO would be moved to the driver side and the application would call DeviceIoControl to request a number of samples.  The loop that queries the FIFO and puts the data in the buffer runs on the driver side. 
Hope this provides a starting point.  Let me know if you have any questions. 
DiegoF