FIRST Robotics Competition Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

AnalogModule->GetAverageValue() gives wrong results if polled too fast

I have found that if I try to read from Analog Module 1 while the battery reading jumper is on it, I occasionally get a value of 1628, even if the signal line is grounded.

This only happens if the battery jumper is in place, and if a Wait(.005) is omitted from the read loop.

I have produced a minimal Wind River project to demonstrate this. Here is the main routine, but the zipped project file is attached:

class AnalogModuleBugTest : public SimpleRobot
{
    Timer timer;
    AnalogModule *aMod;

public:
    AnalogModuleBugTest(void):
        timer()
    {   
        aMod = AnalogModule::GetInstance(1);
        GetWatchdog().SetExpiration(0.1);
        timer.Start();
    }

void Autonomous(void){
        double deltaT = 0;
        float sampleRate = 0;
        INT32 avgValue;
        bool error = false;
        timer.Reset();
        sampleRate = aMod->GetSampleRate();
        while (IsAutonomous())
        {
            GetWatchdog().Feed();

            deltaT = timer.Get();
            if (deltaT > .02){
                avgValue = aMod->GetAverageValue(6);
                if (avgValue > 500){
                     error = true; //<- put breakpoint here <<<-should never get here!
                }
                timer.Reset();
            }   
        //Wait(.005);  // (.005 seems OK) (.0005 has error) (.001 has error)
        }
    }

... balance of code not relevant.

Here is the complete description, which is also included in the zip file ( and the line

numbers will make more sense in the IDE):

This applies to Analog Module 1 only.

Although channel 6 is used in this example I have duplicated the
results for all other channels, except 8 which is battery.

Setup:
(1) Place a jumper on Channel 6 of AnalogModule 1 (slot 1) between
ground and signal, so the voltage and avgValue should be 0, + noise.
Typical value is -4.

(2) Battery jumper is IN place on the breakout board.

(3) Breakpoint is set on line 33 "error = true".

(4) Line 37 "Wait(.005)" is commented out.

(5) Program is run as Debug Kernal Task - with option to attach spawned
tasks checked.

Result:
Breakpoint is reached within 3-4 minutes, often sooner, with
avgValue near 1630.  If the program is resumed from the breakpoint,
the breakpoint is reached again within a similar timeframe.


Comments:
(1) With Wait(.005) at line 37 active the breakpoint was not reached
in a 2 hour span.

(2) With Wait(.0005) the breakpoint was reached as usual.

(3) With battery jumper OUT the breakpoint was not reached in several
hours. (Program was restarted after removing jumper).

(4) With battery jumper OUT - and the breakpoint not having been
reached for 20 minutes - the jumper was put IN while the program was
running.  The breakpoint was reached with 2 minutes.

(3) Same results if using a gyro or accelerometer on channel 6 (or other
channel, including accumulator channels), whether using a live signal
or jumper to ground.

(4) SampleRate is 50000 as is, slightly higher ~51000 if a gyro is used due
to the oversampling/sample rate conversion.

(5) Module 2 (slot 2) seems not to be affected, possibly because placing
the battery jumper on that module has no effect.

Based on these tests it appears that there is a bug when the battery
jumper is in place on module 1.

Regards,
David

0 Kudos
Message 1 of 13
(9,266 Views)

This seems like a similar problem to the one here: http://www.chiefdelphi.com/forums/showthread.php?t=73640. However, I don't see a resolution in that thread.

0 Kudos
Message 2 of 13
(3,637 Views)

It could be the same problem - but their code, while not complicated,  seemed to discourage support from investigating further.

I think I've narrowed the problem down to a very specific set of conditions that should help support troubleshoot this issue. And it

could be a serious issue if you're relying on sensor input for control.

Thanks for pointing out that thread.

David

0 Kudos
Message 3 of 13
(3,637 Views)

This is what I think is going on, I can check into it more when I get back to work on Monday.

The GetAverageValues function grabs values from the AI averaging engine on the cRIO FPGA. Since this is an average value, it takes time to make. The FPGA needs X amount of time to get enough data from the module to average and store into its averaging buffer.

If, on the Real-Time side, you read from this buffer faster than the FPGA is filling it, you will see behavior like this. The reason it is only happening when you jumper in the battery voltage is because the module has an aggregate sampling rate. Meaning the more things you are monitoring, the slower the module samples each.

However, a 5ms wait (or more) should be in your loops on the RT side anyway. Is there a reason you want to loop faster than that? I don't remember the numbers off the top of my head, but I think the motors only update every 20ms anyway. So grabbing average values faster than 20ms to do say, motor calculations, has no benefit.

Additionally, looping incredibly fast in one loop will starve the RT Controller from doing other tasks like network IO. Any other parallel tasks you have, like vision, will not have time to execute.

Stephen B
0 Kudos
Message 4 of 13
(3,637 Views)

Stephen,

I just ran the same code using aMod->GetValue(6), instead of of GetAverageValue(6), and I got the same result - so it's not the averaging that's doing it.

My Robot code did have a Wait(.005) in the two Continuous mode loops; and I agree that polling faster than, say, the motor update time is of no benefit if I am trying to control the motors.

The problem arose when I was experimenting with different Wait() values.

Maybe I don't understand how the Real-Time system works.  I thought that each request I make for data, such as aMod->GetValue(6), eventually calls code at the FPGA level, and that the code there must execute completely before returning a value to my calling function.  Further, I thought that the FPGA code included checking the status of the data lines for valid data - and if necessary waiting for a valid status signal before returning a value to the user code.

The net effect of the above being that my code can never execute faster than valid data can be returned.  I thought that the worst that could happen was that I might get the same, unchanging data several times in succession if my request rate was faster than the data refresh rate.  If this isn't correct - i.e, if there is an absolute minimum wait time required in the continuous loops - can you please point me to the documentation so I may understand it better.

Thanks for your response,

David

I

0 Kudos
Message 5 of 13
(3,637 Views)

Stephen,

First let me say that I cross-posted this message to ChiefDelphi and I got a response from a team that had a similar problem in January.  They said their problem went away when they updated to the lasted code versions.  I am already running the latest software/firmware.  Maybe they also changed their code or increased their Wait() times etc.

I ran more tests and the "spurious" reading I get is directly related to the voltage on the battery jumper pin.  If the DS shows my voltage as 14.15 Volts, the reading I get is 1635, if my voltqage dips to 13.90 the reading dips as well to 1606 ( = 1635 x (13.90/14.15)).   I applied a test voltage directly to the jumper pin, and when the reading on the DS was 8.55 Volts, the spurious reading I got dropped to 987 - which is the value to be expected based on the ratio above.

The reason the problem didn't show up with the jumper pin off,  is that I was testing for a "high" spurious value, when the value on pin 8 was near zero with the jumper off.

So I put 5 volts on pin 6, and took the battery jumper off - and trapped on (aMod->GetValue(6) < 100).  Sure enough it broke with a value of -4.

I ran some additional tests, the results of which indicate that when I get spurious values, they do not come from signals on any other channels - only channel 8.

That's as much troubleshooting as I can do from here.  Hopefully you can reproduce this problem on your equipment.

Regards,

David

0 Kudos
Message 6 of 13
(3,637 Views)

Nice work narrowing down the problem.  I'd say try DMAing the data to check for glitches there, but that's not supported in the C++ version of WPILib right now.  This would get the data by a completely independent interface.

The most likely source of the problem is the fact that the way that WPILib accesses the analog data from the FPGA is through a windowed interface.  That means that the library must write a register to indicate which channel to read from, strobe a bit to load that data into the output register, then read the data from another register.  There are two places that this could break down.  If the synchronization in the library is not working, then one task could change the channel being read before another task can read the data it requested.  I don't believe this to be the case.  The other possibility is that there is a small race condition in the interface to the FPGA (a few PCI clocks worth worst case, 0 clocks typically) in the window interface from the time when the read bit is strobed and when the data is available in the register.  This is the most likely source of the problem.

The reason you would see what you are seeing is that the framework is periodically sampling the battery voltage.  By sampling the other channel very quickly, you increase the chances that you hit the condition where the battery was just sampled and the value that you requested has not been loaded yet.  I believe you would see the same problem if you read some other channel periodically as well.  I don't think it is anything specific to channel 8.

You could test this theory by modifying WPILib AnalogModule.cpp GetValue() and GetAverageValue() to add a 100ns or so delay between strobeLatchOutput() and readOutput().

This is just Gedanken debugging, but I hope this helps.

-Joe

0 Kudos
Message 7 of 13
(3,637 Views)

Joe,

Thanks for the thoughts and suggestions; but alas, no joy.

I pinned channels 3,4 and 5 to the 5 Volt supply and called a GetValue() on each of them right before getting the value on 6, which was grounded.  The only bad data I got came from channel 8.

Next I added the 100 ns Wait(), and when that didn't work I added a 1ms Wait(), (see below), and the problem persists.  An additional bit of information:  if I single-step throught this routine I get the value on channel 8 much more frequently than if I wait for the trap breakpoint to hit.

Aren't the semaphores supposed to protect the registers in question during data aquisition?

Meanwhile, Stephen Barrett from NI is supposed to look into this.  I met him at the kick-off event in Philadelphia and he seems pretty sharp;  hopefully he can get to the bottom of it.

Regards,

David

Modified routine below:

INT16 AnalogModule::GetValue(UINT32 channel)
{
    INT16 value;
    CheckAnalogChannel(channel);

    tAI::tReadSelect readSelect;
    readSelect.Channel = channel - 1;
    readSelect.Module = SlotToIndex(m_slot);
    readSelect.Averaged = false;

    semTake(semaphore, WAIT_FOREVER);
    {
        m_module->writeReadSelect(readSelect, &status);
        m_module->strobeLatchOutput(&status);
        Wait(.001);                                                              <-----  added 1 ms wait!
        value = (INT16) m_module->readOutput(&status);
    }
    semGive(semaphore);

    wpi_assertCleanStatus(status);
    return value;
}

0 Kudos
Message 8 of 13
(3,637 Views)

Hi David,

Joe is actually one of our developers here even though his forum identity doesn't reflect it. On a side note, neither does mine... I just changed my image to the NI logo. There is some bug in the community system that isn't labeling NI employees right. Oh well.

Anyway, I reproduced the problem today. Many thanks to your attached files and great explanation. As from your investigation inside the WPI source code revealed, accessing the registers on the FPGA is protected by a semaphore.

I went and made the exact same code as yours in LabVIEW to try to isolate the issue further and I could not reproduce it there.

I went digging inside the DriverStation.cpp and it calls the get GetAverageVoltage function, which calls the GetAverageValue function and scales it so that it knows battery voltage. Meaning that all reads going through the same semaphore. So the semaphore is protecting the registers from being accessed from multiple places and the wait between the register write and the register read should prevent any race condition.... yet this is still happening.

I've got a setup here to continue debugging the code to figure out what is going on and I will continue to update you with what I find. However, there is still no reason I can think of to need to loop this fast.

Stephen B
0 Kudos
Message 9 of 13
(3,637 Views)

David,

I think if you look at the semaphore, you'll see that it is NULL.  Unfortunatly, null semaphores don't protect anything. 

This thread details the issue: http://forums.usfirst.org/showthread.php?t=12020

I updated the library to add the SEM_Q_PRIORITY option to the semMCreate call and the problem went away.

Don't forget that the race in the hardware that I described earlier really does still exist.  It was just dwarfed by the complete lack of protection.

Cheers!

-Joe

0 Kudos
Message 10 of 13
(3,637 Views)