Bug 2341 - Client should be more conservative about closing sockets
: Client should be more conservative about closing sockets
Status: RESOLVED FIXINSOURCE
Product: pkg
transport
: unspecified
: ANY/Generic All
: P3 normal (vote)
: ---
Assigned To: johansen
: pkg/transport watcher
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-06-26 13:53 UTC by johansen
Modified: 2009-07-01 16:40 UTC (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description johansen 2008-06-26 13:53:43 UTC
Danek and I discussed this a bit last week.  This is my attempt to capture the
relevant portions of the discussion, both for posterity and so that we remember
to work on this.

We observed that many of the timeout errors that we have been seeing seem to
occur during a connect().  The HTTP 1.1 specification allows us to keep our
connections to the server open, but the Python libraries that we're using
aren't so good about this.  Our network connections are frequently established
by leaf routines.  If they don't return a reference to the network connection,
it gets closed when the object goes out of scope.  Often the routines that
receive the file-object from the method that established the connection let the
object go out of scope and be garbage-collected, or close that object
explicitly.  This is an attempt not to exceed the number of open-file
descriptors on the machine, and an implicit confession that we're not presently
doing a good job of reusing our open connection.

It would take a rather substantial re-structuring to improve our connection
re-use, but it seems like that would be worthwhile.  There's a twofold
performance improvement to be had here.  First, we eliminate a bunch of
unnecessary traffic to  do TCP handshakes.  Second, we reduce the possibility
that one of these handshakes will fail, and cause us to time-out.  We also
potentially get the benefit of faster file transfers due to a larger congestion
window.  Short-lived TCP connections might not expand their window much beyond
what's given as part of slow-start.

If we can find a way to keep our network connections open and use them to
handle multiple requests, we can potentially solve a bunch of these problems.
Comment 1 johansen 2008-08-27 14:34:59 UTC
Re-categorizing at Dan's request.
Comment 2 Dan Price 2008-08-28 01:47:15 UTC
Johansen and I talked this over a bit today.  A potential downside of this is
that we may wind up with clients "hogging" server threads for a lot longer than
they do today.  Today, if the servers get overloaded with too many clients, the
clients will tend to all slow down as they have to wait in line to make their
filelist requests.

So moving to a system where a client may wire down a server thread for a long
period of time could induce new problems.  Not to say that we shouldn't do it,
but we should try to understand the risks.
Comment 3 johansen 2009-06-23 13:31:53 UTC
This bug is being fixed as part of the transport re-design.  A preliminary
webrev is available from:

http://cr.opensolaris.org/~johansen/webrev-xport-1/
Comment 4 johansen 2009-07-01 16:40:01 UTC
Integrated 1Jul2009 as change set a48bee2a4b2e9c8345c29acea63116acf77dddb3