TCP Connection Closed if Previous Error?

CoastalMaineBird · ‎09-25-2003

My app is divided into a MASTER part and a SLAVE part.

It will run on several computers in a cluster.

The use submits (via TCP) models for simulation to one of the MASTERS.

The MASTER divides up the model, dispatches various tasks to the various SLAVES, corrdinates results returned by the SLAVES and reports files back to the client.

The SLAVE polls the MASTERS for tasks to do, obtains the model data from them, performs the model computations, and reports results back to the MASTER that requested the work.

The computations are huge; a 1000x1000 complex double-precision matrix inversion is only a piece of a single task, and each task must be run at 300 RPM, 305 RPM, 310 RPM, etc.. A single computer might take 24 hours or more to run a job, hence the effort to spread the work out among several computers.

So there are a lot of TCP connections opened and closed. The SLAVEs poll the MASTERs on one port, fetch models on another, report results on another. Altogether there are 15-20 threads running.

The problem I have is that the system gradually degrades. At first everything runs fine, but eventually, I get a TCP TimeOut error (56) on either the polling, or the results transmission channel. (This is on a test with only one computer, so the SLAVE is talking to the MASTER on the same machine, but it doesn�t know it). Then a few minutes later, another. And more, and more, faster and faster. (It writes a log file which documents this). This happens in a big job, or if I submit 10 small jobs, it will happen somewhere in the middle. If I quit the program and start again, all is fine for a while, then it starts to degrade again.

If the job finishes, the program is idle except for the SLAVE polling the MASTER every 10 seconds or so. But even THAT doesn�t after an error-prone job - It will fail to make a connection from one part of the program to another, even with no heavy computations competing for the CPU.

I normally use a timeout of 200 mSec, even 5000 mSec doesn�t remove the problem.

I have the feeling that some resource iis being consumed as the program progresses. I don�t think it�s RAM, because the Windows Task Manager shows no more than 150 Meg used (out of 512 Meg available).

In staring at the code, I came up with this question, and I wonder if anyone can answer:

I typically use TCP OPEN, TCP WRITE, TCP READ, and TCP CLOSE in a chain - the ERROR OUT is tied to the ERROR IN of the next operation. It occurs to me that if an error occurs in TCP WRITE, then the error will prevent TCP CLOSE from closing - is that right?

I know that the FILE CLOSE operation closes the file regardless of the ERROR IN status, but the TCP CLOSE has no such statement in the docs.

Should I remove the ERROR IN connection to TCP CLOSE, to ensure that the connection is closed regardless?

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Brian_Beal · ‎09-26-2003

I am not an expert in this area, but have done a couple of Labview applications using TCP functions. One thing I've found is that the TCP Read function is very sensitive to the type of read you have selected, and depending on the type of read, the number of bytes. I initially had a problem similar to yours. I had the read function within a loop to constantly get data from a motion controller. If the connection were lost for a moment, it would not reconnect without another open function. I watched the error cluster, and if one occurred, I closed that connection and opened another one. You may want to have some logic like that. You could make sure that if your read or write didn't work the first time, you could try again on a different handle(port).

You
may want to view which ports are open on your system. If the number grows dramatically while your application is running, then that may be an issue.

You should also be concerned that the TCP Write function is producing errors. The TCP/IP protocol is fairly error proof, with regards to traffic on the network. The issues that I ran into were that the receiving device would eventually have no ports available to open up, or there was a physical problem with the hub or wiring.

CoastalMaineBird · ‎09-26-2003

I watched the error cluster, and if one occurred, I closed that connection and opened another one.� You may want to have some logic like that.

In effect, that's what I'm doing - the slave opens a connection, writes to it, and closes it, checking for an error. If an error occurs, I attempt the operation again, some time later.

You may want to view which ports are open on your system.�

How can I best do that? I have used TCPView from SysInternals - it shows a lot of connections (to the ports I am using: 1100-1106) in the CLOSE_WAIT state - what does that mean? A lot of other connections are in the FIN_WAIT2 state - what does that mean?

Those go away (from the TCPView list) when I quit my program. However, there are a LOT of connections left in the TIME_WAIT state after I quit - what does that mean?

I just ran a job and aborted it - there were no errors in the log - everything worked fine. But the nature of the symptoms makes me think it's a resource being consumed.

I ran a full case last night - it completed OK, but the errors started after about four hours of running, and got more and more frequent as it progressed. I had changed all my CLOSE CONNECTION operations to NOT look at the ERROR IN cluster - it made no difference.

UPDATE: All the TIME_WAIT connections just turned red (in TCPView) and disappeared - about 5 minutes after my program stopped.

I am considering re-vamping the comm scheme to open a connection between a master and a slave and keep it open. That increases the amount of traffic for me, but reduces the number of open/closes.

Do you have any opinion on that?

The issues that I ran into were that the receiving device would eventually have no ports available to open up, or there was a physical problem with the hub or wiring."

Given that I'm sending from one process to another ON THE SAME MACHINE, I don't think a physical problem is it.

I actually don't understand the port usage, and why, when I ask for a connection on port 1101, the other end is at port 4357 or something.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

CoastalMaineBird · ‎09-26-2003

Just ran another test with TCPView set to SHOW UNCONNECTED ENDPOINTS.
With my program running, but idle, the slave polls the master every 10 seconds. In effect, this is "do you have any tasks for me?", and the answer is "no".

In the TCP View window, I see a new entry every 10 seconds, from port 1101 (the one I specifed) to port 4560, 4561, 4562. etc - there are over a hundred in the list (only if I show Unconnected Endpoints), and ever-expanding.

Is THIS the resource I am consuming? If so, I don't understand how - I'm sure that I'm closing the connection every time. I'm not using the ABORT option on TCP Close, but the docs say it's ignored anyway.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

LabVIEW

TCP Connection Closed if Previous Error?

TCP Connection Closed if Previous Error?

Re: TCP Connection Closed if Previous Error?

Re: TCP Connection Closed if Previous Error?

Re: TCP Connection Closed if Previous Error?