Tim,
In order to "fetch more than the amount of onboard memory" you have to use the streaming application. If you don't stream the data out of the onboard memory, you will run out of memory if you acquisition is larger than your maximum memory on the device.
The fetch from memory by your application and the storage of data into that memory by the hardware are independent processes. That's why using the FIFO analogy works so well. On the input of the "FIFO" you have a hardware timed operation which is totally system independent. Its controlled purely by the clocks, events, and triggers to which you provide the hardware. On the output of the FIFO you have your software fetch which is a system dependent process. It is controlled by your LabVIEW code and the HSDIO driver calls you make. The memory decouples the system from the actual acquisition so your hardware timed operation will not be affected by any system limitations you may have.
That being said, if in a streaming application you don't fetch data out of the memory fast enough, the memory will eventually fill up. Fetching from memory will extend the duration of the acquisition, but if you cannot fetch faster than you store data, you will eventually run out of room. The rate at which you can fetch data will be totally dependent on your system. IE, how fast is your harddrive, do you have other PXI devices using the bus too, etc.
There will be interruptions in fetching data from memory, but you will not lose any data. By using the backlog or the streaming example, you are fetching chunks of data at a time. The size of each consecutive fetch may be different but the data is continuous. That is, the first sample of your second fetch will have been captured one clock cycle after the last sample of your first fetch.
Using an external device, you could fire a signal when the serial stream outputs some pattern. You could output that as a trigger to your generation to send the next command. You could certainly use it as a Pause trigger as well but it seems that the trigger would deassert on the next sample as the pattern has shifted through so you're really only saving a single data point. Using some PLD i'm sure you could make the trigger a little more smart than just a shift register and pattern matcher to gain bettern triggering but that would be totally dependent on your application.