cRIO Locks Up - RT

xkenneth · ‎10-18-2010

All,

I'm currently trying to debug an issue where my cRIOs will freeze. I've got quite a big VI i've been running with one cRIO for a while. And I've just deployed that same code to 4 different cRIOs. For some reason within about 20 minutes usually 2 of them will hang and the others will hang shortly after (within an hour or two.) By hang I mean completely unresponsive via Max, FTP, Distributed Status Manager, etc.

I've tried running my code in development mode to see what locks up, but the LabVIEW environment simply disconnects from the real-time.

There doesn't appear to be any useful information in the real-time error log anyways.

Any suggestions? I'm going through and disabling my loops one by one to find the problematic one.

Regards,

Ken

xkenneth · ‎10-18-2010

All,

This is the software stack I have on the system.

LabVIEW Real-Time 9.0.1

Language Support for LabVIEW RT 1.0.0.3

NI-RIO 3.5.1

NI-VISA 5.0 -> NI-VISA Server 5.0

System State Publisher 2.0.0

Regards,
Ken

Wayne.C · ‎10-18-2010

Ken,

If you use MAX to view the 'Network Settings', is the box next to 'Halt system if TCP/IP fails' checked?

viScience · ‎10-18-2010

Have you tried monitoring the cpu and memory usage?

Are you using NSV's? Check the DSM (Distributed System Manager) to verify that all RT libraries are deployed.

Are your NSV's static or Dynamic? Could a NSV URL have the wrong IP address? I like to wire up all my error clusters

and then trap anything unusual and have it generate a RT fault that can be monitored by the DSM.

gsussman · ‎10-18-2010

This sounds like the classic symptom of RT system 100% CPU utilization.

In both PharLap and VxWorks, if the CPU use is pegged at 100%, the RTOS will begin a sort of "load shedding" and will suspend various threads and processes to try and maintain determinism of the code. One of the first threads to get dumped is the TCP-IP thread. This essentially will leave your RT system deaf to the rest of the world until the CPU usage drops down below 100%. When debugging your code this may mean that you need to fully reset the system to regain communication.

As a debugging tool, try adding wait statements to some of your code to slow things down a bit and see if the problem does not go away. Often times one loop may be the cause for this thread starvation issue.

It seems counter intuitive to add wait statements to speed your code up, however with the way RT schedules execution this is what may be required to allow the other parts of your code and the RTOS time to execute.

Greg Sussman
Sr Business Manager A/D/G BU

Lucy · ‎10-19-2010

I have had/am having the same problem on 9012 controller.

For many reasons the communication is through TCP/IP direct (rather than network published shared variables).

I'm replying to this thread with a description of the issues I'm having at the moment as they seem very similar and may help your fault finding along. I'm convinced my problems are an NI issue because all I've done is a platform upgrade of very reliable software - of course you may need not be in the position to say that yet as it could be caused by various other things (I've been through all of them..).

I first had a problem 3 years ago, when I first started using 9012 controllers with 8.2 or 8.6 Labview, 3.0? Rio. Systems just locked up and stayed disconnected, despite implementing a watchdog in the fpga that will activate a hardware reset if nothing happens for a while. The solution that time was a VXworks upgrade.

Everything has been extremely reliable since then on 8.6, however I've decided its time to upgrade to 2010 and very similar faults have returned. The system suddenly stops and the TCP/IP appears to hang.

Most things disconnect although the TCP/IP connection controlled in the software doesn't drop immediately, it becomes unresponsive. The connection through MAX and ftp stops. It may or may not respond to a ping.

I've been using the NI Distributed Systems Manager to view the CPU and memory statistics and this is more likely to remain connected. The CPU usage chart shows a sudden drop from 50- 70% usage in normal running to 5%. Baseline with no rtstart app running is <2%.

Because I've been using the usage chart all the time, I know that the system didn't get anywhere near 100% usage, and actually, I force it to run near 100% it can recover, unless the other issue disconnects it.

I've also been running selective subsets of the program with the dma transfer from the fpga and analogue processing disabled and still get the same issue. Also removed all shared variables back to old style globals - this improved performance but the system still crashed out. In my mind the likely candidate is the OS.

Hope this helps the thread along - I'm desperate to get a solution quickly too.

Lucy

gsussman · ‎10-19-2010

There is another issue that exists with TCP connections and RT systems if you are using dynamic port assignment to connect to the system.

If you are using dynamic port assignment on the RT side AND are performing many connect/disconnect operations then there is the possiblity that the system will run out of port numbers and fail to allow further connections to the system (CAR: 242786)

In essence what seems to be happening is that when a TCP socket is broken or closed the port number is not returned to the list of available ports in the dynamic pool. After about 230ish dynamic port assignments and disconnects, the RT system no longer has any port numbers left in the dynamic pool and fails to create a listener.

I am not sure if this is similar to your issue, however it presents itself similarly to the CPU thread starvation "feature" in RT, however the CPU may not be pegged out at 100%.

The issue is present in LV2009 although I would also suspect that it is in LV2010 as well due to the CAR being filed after the 2010 release.

Greg Sussman
Sr Business Manager A/D/G BU

Lucy · ‎10-19-2010

Thanks for your answer. It still looks to me that our two problems are similar enough to continue in this thread.

I don't think the port allocation is the issue as in 8.6 it runs fine and the connections never drop once the initial connection is made.

In fact, at the moment I still have an open, two way, connection between the PC application and the RIO, but no connection on MAX or ftp. Eventually the other connection will drop out too.

Even so, it would help if someone explains in more detail the difference between dynamic port allocation and otherwise, and if there is a diagramatic solution to this.

Also, I was thinking that our problems may be due to a 'dirty' upgrade. Before installing 2010 through MAX I removed all programs, then reinstalled from a clean base. Is it worth going further and reformatting the RIO? If so, can someone point to help/knowledge base explaining this.

By the way, I had the same behaviour when I tried upgrading to Labview 2009 but went back to 8.6 as we needed to ship the RIO system and it behaved itself .

Lucy

gsussman · ‎10-19-2010

Dynamic port allocation is not totally dynamic, however the basis of how this is done is described below.

For the sake of this discussion our Host system will be the RT system and the Client will be your Windows system, however this discussion really applies to any architecture.

1. The Host creates a listener on a defined port and waits. The port# is explicitly defined in the TCP Create listener function, for sake of discussion lets say port 2020.

2. The client opens a TCP connection to the host on port 2020.

3. Upon detecting the new connection, the Host will run TCP Create listener again with no port number wired in (this will cause the port number to be dynamically assigned)

4. This new port number is then transmitted from the Host to client over the socket connection on the known port (2020).

5. Upon receipt of the dynamic port number transmitted over 2020, the client will close the connection on port 2020 above and open a new socket connection to the port number transmitted in #4.

6. When the host sees the client close the socket on the port 2020, it will also close the connection and run the create listener function again on port 2020 (the Host is now ready for another dynamic connection)

7. All further communication between the Host and client is now carried out over the dynamicly assigned socket connection.

The advantage to this approach is that the defined port connection (2020)is only open for a very short period of time while the dynamic port # is obtained and transmitted. After that, all communication is on the dynamically assigned port socket connection. This will preclude a loss of the ability to connect to the system if the socket connection is lost and goes into a TIME-WAIT state. Once in TIME-WAIT you will be unable to reconnect to that port for 2X the segment lifetime (about 2 minutes).

This approach is especially important when dealing with RT systems as VxWorks and PharLap will lock out the port as described above if the connection is broken either by a loss of network communications or if the Host closes the socket first. You can take a look at RFC-793 for a detailed explanation of how the connection is made and broken.

Windows systems implement a feature that allows a connection to transition directly from the TIME-WAIT state back to active so the need for the dynamic port assignment on a Windows host is really not required.

Another feature of the dynamic port assignment is that your host will then have the capability to have multiple clients connecting to it simultaneously.

Greg Sussman
Sr Business Manager A/D/G BU

Robert_Hoffman · ‎11-03-2010

Has there been any additional information on this potential issue? I am also having intermittent lockup issues on a large CRIO application (CRIO 9022 with LV 2009SP1) that seem eerily similar. However, I am not using dynamic port allocation for my TCP/IP connection.

Rob

LabVIEW

cRIO Locks Up - RT

cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT

Re: cRIO Locks Up - RT