RoboRio 1.0 Crashing , need ideas!

davethedirt · ‎01-25-2024

Hi. I'm Dave, programming mentor for team 281 in Greenville, South carolina. We could really use some help: we are experiencing roboRio 1.0 crashes and we're about out of ideas.

The Behavior:

After the robot is powered on, and the driver station is connected, the robot will lose comms within about 30 seconds to 3 minutes, depending on what we have running. No activity is required at all to produce the failure ( the robot is still). The progression of the failure is always:

1. Comms will be lost, and ssh sessions die

2. robot code runs for another little bit ( 30 seconds to 1 minute) then dies

A hard reset of the robot is required to recover from that point.

When watching memory usage using free -m or top while connected to the rio, we notice that free memory is gradually declining.

We are using photon vision with an orage pi co-processor. We have noticed that the problem presents much more quickly ( 2-3 minutes ) with vision data and camera feeds on, and much more slowly when vision is off.

Configuration

* java robot

*roborio 1, 6.0 firmware, latest wpilib 2024.1.1 on both driver stations and robot.

*swerve drive, with analog absolute (thrifty absolute encoders) plugged into the roborio analog ports

* photon vision on an orange pi with two cameras

* navx 1.0

What we have tried

* follow the NI hard reset steps, completely flashed firmware on the roborio from two different computers

* swapped this roborio for a different one, wipe and re-image

* program wpilib code on the robot from 3 different computers

* swapped radio for a new radio, programmed in both bandwidth limited and no limit mode

* tried installing and removing a usb stick for advantage kit logging

* tried unplugging the analog encoder cables

* connect with wired network to get more data via ssh before we get dumped. No dice, even on wired network, ssh is one of the first processes to die

* replaced navx with another one and also with a navx2-- no change in behavior

* replaced network cables

* ran with visoin turned off. This makes it take longer to crash, but doesnt change anything else

* ran without advantage kit. still crashes, but takes longer

* tried running the robot in the foreground using runCommand.sh-- but for some reason this was not working

In one session, we connected with ssh, and watched for crashes using dmesg --follow. ssh crashes first, but we did manage to catch a few dumps in the act. It looks to us like there is memory pressure-- note that all of these traces are around memory:

https://drive.google.com/file/d/10fLHnZxfmo7i5CJIFabY2j4lnAOEreTK/view

At the time this dump was taken, we had a usb storage stick in one of the roborio usb ports. You can see a creash of usb-storage, then navxIOThread, frc_net_comm in this trace, they all look memory related

We have noticed that sending a lot of network table data makes the problem happen faster.

When watching free memory, we noted that with a usb memory stick inserted, free memory starts at about 10m, gradually declines until a crash at 5m or so.

without a usbstick inserted, free memory starts at 22m, and declines much more slowly-- but the crash still happens

We have done one test with the analog encoders unplugged, and it ran quite a bit longer without crashing: we didnt run long enough to prove it would _never_ crash, but it was improved.

Since turning off vision or advantage kit makes the problem take a LONG time to manifest, we're pretty sure if we ran a empty robot , we would not see crashes.

We could use ideas about how to pin this down, help is appreciated!

davethedirt · ‎01-25-2024

Oh one other note. We think our problem looks very simliar to this one from last year:

https://www.chiefdelphi.com/t/roborio-chronically-crashing/426499/12

We were unable to follow the steps to get a core dump because ssh itself dies, but the symptoms are similar, and we're using photon vision sending a lot of networkTables data too. And indeed, reducing data logging and NT does make it take longer to crash.

davethedirt · ‎01-25-2024

sorry also one correction: we're using wpilib 2024.2.1 not 2024.1.1

oscarfonseca · ‎01-27-2024

Hello davethedirt,

Thanks for using the NI forums. I read the roboRIO is crashing periodically at different intervals and you have tried using SSH to catch the symptom.

What happens if you use a default robot code template that ships with WPIlib? Does the code still crash in the roboRIO after running it for some minutes? If it does, can you reproduce the crash with the same empty default robot code but after reimaging the device?

If it does reproduce, I think we would need to look at some corruption on a lower level (maybe search for badblocks in the memory).

If it doesn't, then I suspect the issue is in the team code you are running on the device. I recommend disabling different parts of your code and checking when it reproduces so you can isolate the problem to the specific library or algorithm causing the issue.

Regards,

Oscar
Principal Product Manager — Academic
NI

PaPaJones12 · ‎01-29-2024

Oscar,

I am one of the mentors for FRC team 3937 out of Searcy, AR. We are currently having the same issues being reported here regarding the RoboRIO 1. We, too, are searching for ideas on how to find/correct this issue. We have performed and checked the same items as mentioned by team 281. Our configuration is very similar. Let me know what additional information you may need from us.

RoboRandy

mshafer · ‎01-29-2024

Hi PaPaJones12,

This is Matthew with NI.

1: Please post a new thread for your team so it's easier for us to keep track of which teams have tried what.

2: As Oscar noted, a good debugging step at this point would be to deploy a blank template project to test if it still reproduces (that will help narrow down device / infrastructure vs team code).

davethedirt · ‎01-29-2024

Hi, Oscar, ( and also NI).

We worked on this most of Saturday. We eventually solved it ( at least for us).

While we did confirm that memory usage correlates to the frequency, in the end, this is a problem with the navx firmware. We noticed that a navxThread was running in each crash, and confirmed by observing that we had no issues once we removed the navx. We had eliminated the navx as a cause, because we had the problem also with a navx2.

We think the navx is triggering the well-known roborio 1.0 issues with I2C.

We switched to a usb-based navx, and our problem went away.

We observed that the firmware listed as the latest needed by kalua labs is (a) not the version they are sending in their latest installer and (b) not available for download best we can tell. We've put a question out on Chief Delphi to see if anyone else has found the place to find the latest navx firmware.

_anyway_ -- we are solved

Troubleshooting Hardware

RoboRio 1.0 Crashing , need ideas!

RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!

Re: RoboRio 1.0 Crashing , need ideas!