Bugzilla – Bug 2341
Client should be more conservative about closing sockets
Last modified: 2009-07-01 16:40:01 UTC
You need to log in before you can comment on or make changes to this bug.
Danek and I discussed this a bit last week. This is my attempt to capture the relevant portions of the discussion, both for posterity and so that we remember to work on this. We observed that many of the timeout errors that we have been seeing seem to occur during a connect(). The HTTP 1.1 specification allows us to keep our connections to the server open, but the Python libraries that we're using aren't so good about this. Our network connections are frequently established by leaf routines. If they don't return a reference to the network connection, it gets closed when the object goes out of scope. Often the routines that receive the file-object from the method that established the connection let the object go out of scope and be garbage-collected, or close that object explicitly. This is an attempt not to exceed the number of open-file descriptors on the machine, and an implicit confession that we're not presently doing a good job of reusing our open connection. It would take a rather substantial re-structuring to improve our connection re-use, but it seems like that would be worthwhile. There's a twofold performance improvement to be had here. First, we eliminate a bunch of unnecessary traffic to do TCP handshakes. Second, we reduce the possibility that one of these handshakes will fail, and cause us to time-out. We also potentially get the benefit of faster file transfers due to a larger congestion window. Short-lived TCP connections might not expand their window much beyond what's given as part of slow-start. If we can find a way to keep our network connections open and use them to handle multiple requests, we can potentially solve a bunch of these problems.
Re-categorizing at Dan's request.
Johansen and I talked this over a bit today. A potential downside of this is that we may wind up with clients "hogging" server threads for a lot longer than they do today. Today, if the servers get overloaded with too many clients, the clients will tend to all slow down as they have to wait in line to make their filelist requests. So moving to a system where a client may wire down a server thread for a long period of time could induce new problems. Not to say that we shouldn't do it, but we should try to understand the risks.
This bug is being fixed as part of the transport re-design. A preliminary webrev is available from: http://cr.opensolaris.org/~johansen/webrev-xport-1/
Integrated 1Jul2009 as change set a48bee2a4b2e9c8345c29acea63116acf77dddb3