Aborting frozen actor

nzamora · ‎07-31-2020

I have had an issue present itself in the last month or so that has been kicking my butt. I'm not 100% convinced it's directly AF related, but I think there is potential to solve it with AF.

Lets say I have 3 actors: Main, Processor, Controller.

Main does nothing other than launches Processor and Controller.

Processor receives a message from Controller to do some image processing. After it is finished, it sends the results back to Controller via a message.

Controller sends a message to Processor and expects a response back, not as a reply message but as a completely separate message.

What has been happening is (very rarely) after receiving a message, Processor is freezing. Like, completely and utterly non-responsive. In Development, using highlight execution, it is just stuck in a VI that has no reason to take any longer than a few milliseconds. I've tried throwing logging between almost every little step it does. There is 1 situation where the last step of a VI is log, and the first step of the next VI is to log. It froze between them. There is literally nothing there except a wire connecting the two VIs! I've confirmed with MGI actor tracker that Processor is indeed still running after an otherwise normal shutdown. In development, the only way to ever regain control is to end Labview and reload the project. In an executable it's similar, the process will never fully close, so it has to be force ended.

So I added a method to Main to send a message to Processor every 5 minutes to see if it is responsive. If the actor somehow just crashed or otherwise is no longer running, trying to send a message should error since the actor reference would no longer exist. No luck. Messages are happily received. So the Actor seems to still be alive, it's just frozen in the processing of a message.

Now as if that wasn't bad enough, we just pinned down a rash of application crashes in the field due to this freeze. It is very strange to us though, because although it's clear from the logs that this freeze has happened, the application as a whole was just sitting idle(intentionally) for anywhere from 10 minutes to 4 hours after the freeze before the crash happens.

So my thought on possibly getting around this is somehow forcibly aborting just the Processor actor since sending it a proper shutdown command cannot work. I've searched a bunch, but I could not find any confirmation if this is or is not possible. Does anyone have any ideas? Did I miss any other obvious troubleshooting steps? Was anything not clear?

LabVIEW 2016 f7

VDM 2016

OneOfTheDans · ‎08-01-2020

If I understand correctly, the freeze you're describing affects your actor and prevents it from handling any other messages. In that case, there's nothing AF can do for you, because even an "Emergency Stop" won't be seen until the freeze is resolved.

As a workaround to the freeze, I suppose you could ~~hack up~~ extend the shipping Actor Framework to store the launched Actor Core's VI ref, then send Stop/Terminate if you detect a freeze. I can only imagine that solution creating more problems than it solves.

The real solution is continuing to dig in and find why LabVIEW is hanging randomly. If you can't identify why, then even your workaround might freeze right before you send Stop/Terminate, and now you're doubly stuck.

When I've seen LabVIEW freeze like you describe, it's usually related to external IO (disk, network, GPIB, RS232, etc.). We've had issues with (unconfirmed) bugs in NI VISA where a VISA Read fails to return, fails to timeout, and deadlocks LabVIEW.exe. What if your Log VI is causing the lock?

The other thing that "freezes" LabVIEW is if you accidentally call some kind of while(true) loop that never returns. This isn't really a freeze, because the code is still executing as expected, and if you fix your code it will work. Also, highlight execution (like you already tried) will eventually lead you to these.

If you've been stuck on this for a month, definitely seek help & a second set of eyes. If you can pare down your project to something shareable, the people on these forums or NI Support can look for any "code smell" that might be causing your freeze.

Dan

nzamora · ‎08-03-2020

@OneOfTheDans wrote:

... What if your Log VI is causing the lock?

The particular location where it is freezing was happening prior to adding logging. I suspected where it was based on higher level logging, and confirmed with highlight execution(VI had green run arrow on it for way longer than it should have) and by adding the logging. The logging VI was written by my predecessor, and is used extremely unlikely to be the issue.

The other thing that "freezes" LabVIEW is if you accidentally call some kind of while(true) loop that never returns. This isn't really a freeze, because the code is still executing as expected, and if you fix your code it will work. Also, highlight execution (like you already tried) will eventually lead you to these.

Definitely not the case here. There were a few cases where it appears to have frozen in or around IMAQ Match Pattern 4.vi. However, this is not always the case, which is one reason it's so hard to pin down. If it appeared to be consistent, at least I would have somewhere to look at, but according to the logs, it has frozen in almost every step of this process.

paul.r · ‎08-07-2020

Sounds like you've got a deadlock somewhere in your code. I would focus on finding the root cause, rather than trying to force the actor to stop. (You could probably grab the vi reference to actor core of the actor that is freezing and abort it, but I wouldn't spend any effort going down that path - fix the root of the problem)

Are you using DVRs or other shared resources between actors? Bounded queues? How about non re-entrant vis? Do either of your actors launch other actors? Can you give a little more information about what the processor actor does?

AristosQueue (NI) · ‎08-07-2020

@paul.r wrote:

Are you using DVRs or other shared resources between actors? Bounded queues? How about non re-entrant vis? Do either of your actors launch other actors? Can you give a little more information about what the processor actor does?

Reply Msg on both caller and nested actor?

Volpe_CERN · ‎03-24-2021

Dear all,

I'm also facing a similar issue: I'm upgrading the code of an experimental setup where some part do not have to hang, no matter what, and since we want to move to a more dynamical and asynchronous approach, I've started implementing the new system via the Actor Framework.
The problem is that some of the VI managing this critical services do hang, from time to time! The solution that my colleague found while coding this VIs was to compile them as an .EXE, embed a watchdog system inside and have a guardian VI that monitors the messages from the watchdog and terminate (Taskkill /F /PID pid_number) the process every time it would go stuck, and then relaunch it.

My idea was to transform these processes into independent actors, but the Send (Emergency) Stop Message is useless if the actor is stuck while processing another message. While I will for sure dig into the problem that hang the VIs in the first place (I guess related to some hardware I/O), I need the possibility of killing an unresponsive actor, otherwise I guess I'll have to drop a good part of the architecture and go back to the .EXE method.

I tried to store the VI Server Ref of the instance of Actor Core, or even of the Actor.vi, but using an Invoke Node -> Abort produce Error 1000, because it is not possible to programmatically abort a VI not launched with Invoke Node -> Run. So, while I don't like the solution of modifying the AF, I come up with the following solution: a slow polling loop inside the Actor.vi that reads a Functional Global Variable and, in case, use the Stop function. It's ugly, but it does the job.

Is this method extremely dangerous? Do you seen better way of killing an unresponsive actor?

Thanks in advance,

Volpe

drjdpowell · ‎03-24-2021

I think you should go back to the separate EXE. That "Stop" function is not certain to stop a hung VI, as the VI might be hung in a dll call (those can't be aborted). You need the big hammer of Taskkill.

I have implemented similar things (though with "Messenger Library" rather than the Actor Framework) for must-operate-unattended-for-many-months operation. I have a "Watchdog" which is a single Messenger-Library actor that is built as a separate EXE, which exchanges heartbeat messages with my main app (the two apps watch each other, and will Taskkill and restart if needed).

Actor Framework Discussions

Aborting frozen actor

Aborting frozen actor

Re: Aborting frozen actor

Re: Aborting frozen actor

Re: Aborting frozen actor

Re: Aborting frozen actor

Re: Aborting frozen actor

Re: Aborting frozen actor