cRIO-9047 crash and loss of network connection

emacleo1 · ‎01-19-2024

Hi all,

I am currently having an issue with my CRIO that is deployed remotely after many months of smooth operation and am stumped as to what may be causing it. We have a system that records continuous data at a bridge we are monitoring and uploads it to SharePoint. The system is what I would describe as “just works” as we are structural engineers and a lot of the finer nuances of both LabVIEW, as well as network communication are currently over our heads. The system is a remote system that is in a cabinet under the bridge. Our system’s hardware comprises two parts:

cRIO-9047 on-site with a mix of NI9230 and NI9237 cards as well as a POE Basler camera connected via ethernet port 1
Dell desktop computer for remote access and file transfer

The way the data flow works is there are two VIs, one to collect sensor data, and one to collect images, that are packaged into a real-time startup app. The data is written to TDMS files on an external hard drive plugged into the USB slot on the cRIO, and a new file is generated each hour. Every day a Python code on the computer executes and connects to the cRIO through SFTP, then transfers and zips the files to the computer where they are passively synced via SharePoint through our fixed wireless internet connection on site. It sends an email with either a success and the list of files transferred or with a failure notice. This is all accessible to us at the office through using Team Viewer which typically enables us to debug our system remotely. We have this system running at two bridges and the system has run perfectly fine for months with no issues besides the odd power loss externally.

However, recently we have begun to experience problems and now the system will not run for more than a day. It all started when I first received a file transfer failed email, however, when I logged into the machine remotely I could no longer connect to the cRIO via FTP or see it in NI Max. Initially, I thought this was more of a network issue (we had problems with the stability of the ethernet connection between the computer and cRIO in the past) and I figured I’d maybe need to reset the driver or just cycle the power. I figured it wasn’t a big problem as the CRIO should be happily running heedlessly collecting data and I could just transfer the backlogged data manually when I had the chance to get to the site. This power cycle was accomplished for me by power loss regionally due to a snowstorm and afterwards, the system seemed to have come back with no problem and I could connect to the CRIO again.

The system only ran for a few days before sending another file transfer error. At this point I went down to the site to have a look thinking maybe power was lost to the CRIO but when I checked it was powered on and all the lights were green. However, the network adaptor card in the computer was flashing yellow and orange (figured this was not good?!). I cycled the CRIO power and it all booted back up and started collecting data again, however, this time it only ran for a day before the problem occurred again. When I checked the hard drive manually, I finally clued in and noticed that it had not still been collecting data after the network connection loss as I had assumed as the files ended abruptly at the same time the connection was lost. So this problem was causing something to hang on the CRIO as well.

Thinking it was the network connection that was somehow causing issues I reset the network adaptors on my computer, which inadvertently uninstalled my TeamViewer connection so I ended up needing to go back to the site anyway to re-establish the connections manually. It again ran for only a few hours so I went back intending to just remove the whole system so that I could troubleshoot it under more supervision back at the office as it seemed like the problem was bigger than just a shoddy network connection.

I brought back the controller and the computer and plugged in a second POE camera that we had and set the code to run with some added probes in remote panels to see if I could catch what the error was that was crashing my code. The system then proceeded to run without a hitch for 4 days…. The only outlier here though is that it seems my camera keeps dropping after an hour or so, throwing the camera not found error. This happens with both my code and with the LabVIEW examples so maybe my Amazon POE is patchy? This is a new problem in the lab and was not happening on-site. I am not sure what effect this is having on my overall debugging as the camera VI still runs, it just zips empty folders instead of folders with images.

So in summary, our system worked fine for months, then began to experience some error, stopping the code, and leaving the cRIO in a state we can no longer connect after shorter and shorter intervals of working.

So after that tirade, my questions would be:

What type of error could my system be experiencing, that both hangs my real-time app and makes it so I cannot connect to the CRIO?
How would this have just started out of the blue in one of my two systems?
What are some steps I could take to try to identify it now that it seems to be running, I want to avoid putting it all back until I have it sorted (It is cold (-20C) to be debugging LabVIEW under a bridge ha!)

I have attached my Vis below and can provide any material that may be useful!

Thanks

Ethan

Apex_Waves · ‎01-19-2024

Hello!

The first thing to check would be the external hard drive. You can check the specifications for it, but I doubt that it is rated for temperatures that low. USB hard drives fail often, especially if it is constantly running in an extreme environment. A failing hard drive can cause all sorts of weird problems, but you can use Crystal Disk to check it's health. Even if the check shows as "good" in the lab, it may still have problems out in the field. You will want to replace that first and then move on to your POE injectors, which also probably don’t like operating below 0° C.

Check the camera specifications and see if they are rated for the temps. If you still have problems, you can format the cRIO from MAX and reinstall the software- you can also choose to keep the network settings before it formats. The cRIO is rated for extreme environments, so that is not likely not the problem.

You can always ping the cRIO’s IP address or do a trace route to verify the network connection. To do this, go to Windows command prompt, and then type "ping" and the IP address of the cRIO (eg. ‘ping 192.168.1.50’ or ‘tracert 192.168.1.50’). Look for a long hop time, probably more than 30ms. Ask your IT department for assistance if you notice a long hop. A faulty switch in your network can cause intermittent connection issues.

You might find these links useful:
cRIO-904x User Manual
Microsoft TRACERT

Hope this helps!

emacleo1 · ‎01-19-2024

Hi, thank you for the suggestions! I should have mentioned that the cabinet is heated specifically for the temperature related reasons you mentioned and it has performed well in Canadian winter temperature previously.

As for the POE, its only been since using my second camera and POE at the office that I have experienced problems, I have never received that error previously. In the lab, the camera dropping does not cause my whole system to crash and in the field, I have images collected right up until the program crashes so I am not sure if they are related?

I could see a faulty drive causing issues with the code trying to write files, but do you think that type of issue would make the cRIO not show up in MAX or be reachable via FTP as I am experiencing when the issue occurs?

I can definitely reinstall all the software on the cRIO just to be safe, that seems like a safe bet to try.

For the network test, I can give that a go too. The cRIO and the desktop at the site are connected locally, if there was a long delay in the ping, would this be a fault with the network card then? We access the whole thing through our fixed wireless internet connection (its a rural site) and it can be slow and times but not much we can do there.

Thanks again for the help, much appreciated

Apex_Waves · ‎01-22-2024

A faulty drive could definitely cause issues with writing files- it could also make the whole controller lock up and become unresponsive over the network. If you see a long response time with a ping in the lab, that could indicate a local networking problem. You will want to check your ethernet patch cables in the lab and connected to the camera. I suggest using quality cat5e or cat6 cables- avoid the cheap / no brand ones from Amazon. Some good brands would be Tripp Lite or Cable Matters.

If the computer is relatively new, it probably has an integrated network card so it can’t be replaced. More likely, the drivers or some installed software is what is causing a problem.

Do a speed test with the computer connected to the internet. If you have a slower speed than other computers on the same network, that could be the issue. A good internet connection should be over 100 mbps. Try booting the computer in safe mode with networking, and then try the ping and speed test again. If the speed improves, that is a good indication that some software is causing the issue. Look in your installed programs and see if you notice anything not required and then uninstall it.

Hope this is also helpful!

Real-Time Measurement and Control

cRIO-9047 crash and loss of network connection

cRIO-9047 crash and loss of network connection

Re: cRIO-9047 crash and loss of network connection

Re: cRIO-9047 crash and loss of network connection

Re: cRIO-9047 crash and loss of network connection