Bug 10010 - reconfigure transport timeouts
: reconfigure transport timeouts
Status: RESOLVED FIXINSOURCE
Product: pkg
transport
: unspecified
: ANY/Generic All
: P2 major (vote)
: ---
Assigned To: johansen
: pkg/transport watcher
:
:
:
:
: 9790
  Show dependency treegraph
 
Reported: 2009-07-13 14:43 UTC by johansen
Modified: 2009-09-11 17:06 UTC (History)
8 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description johansen 2009-07-13 14:43:10 UTC
After switching to the new transport, some clients are getting timeouts more
frequently than one might expect.  This is because libcurl has two different
timeout values:

(from the libcurl documentation)

TIMEOUT - [T]he maximum time in seconds that you allow the libcurl transfer
operation to take. Normally, name lookups can take a considerable time and
limiting operations to less than a few minutes risk aborting perfectly normal
operations.

CONNECTTIMEOUT - [T]he maximum time in seconds that you allow the connection to
the server to take. This only limits the connection phase, once it has
connected, this option is of no more use

LOW_SPEED_LIMIT - [T]he transfer speed in bytes per second that the transfer
should be below during LOW_SPEED_TIME seconds for the library to consider it
too slow and abort.

Python's socketmodule uses the timeout to measure the maximum amount of time
that it will wait between packets, which is different than all of these values.
 Essentially, though, we want to make sure that the client continues until it
can no longer make progress.  It seems like the best approach would be to use
CONNECTTIMEOUT, LOW_SPEED_LIMIT, and LOW_SPEED_TIME in combination.  That way
clients connecting to an overloaded server give up after a reasonable amount of
time, and clients connected to an unnecessarily slow connection give up and try
elsewhere.
Comment 1 johansen 2009-07-13 15:40:36 UTC
Configure client to use CONNECTTIMEOUT, LOW_SPEED_LIMIT, and LOW_SPEED_TIME in
combination.  Remove use of TIMEOUT.
Comment 2 Alan Steinberg 2009-07-13 16:03:45 UTC
The error from pkg install may look something like this:

Errors were encountered while attempting to retrieve package or file data for
the requested operation.
Details follow:

1: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 5954837 out of 16900743 bytes received
URL: 'http://ipkg.sfbay/dev'.
2: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 5962793 out of 16900743 bytes received
URL: 'http://ipkg.sfbay/dev'. (happened 2 times)
3: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 5946881 out of 16900743 bytes received
URL: 'http://ipkg.sfbay/dev'. 

The workaround is to set PKG_CLIENT_TIMEOUT to a higher number than the
30-second default:

export PKG_CLIENT_TIMEOUT=300
Comment 3 johansen 2009-07-13 16:19:43 UTC
(In reply to comment #2)
> The error from pkg install may look something like this:
<...>

This error message actually isn't sufficient to determine that the problem has
occurred.  The case that you ran into was that the amount of time it would take
to transfer the data exceeded the timeout value.  We had to determine the size
of the file and the transfer rate before it was apparent that the timeout was
too low for this transfer to continue.

Other individuals who are encountering this problem should check their transfer
rate.  A PKG_CLIENT_TIMEOUT of 900 seconds, is probably sufficient for most
broadband users who are at home.
Comment 4 johansen 2009-07-13 17:48:23 UTC
A webrev for this fix is available on cr.opensolaris.org:

http://cr.opensolaris.org/~johansen/webrev-9996-2/
Comment 5 Mary Ding 2009-07-13 19:07:24 UTC
I had seen this timeout error when I try to do AI install with osol_118 from
ipkg.sfbay/dev.

My sun4v with 6144 MB fails as follow:

installation will be performed from http://ipkg.sfbay/dev (opensolaris.org)
installation will be performed from http://ipkg.sfbay/dev (opensolaris.org)
list of packages to be installed is:
entire
SUNWcsd
SUNWcs
babel_install
list of packages to be removed is:
babel_install
slim_install
pkg: image-create: 
pkg: 0/1 catalogs successfully updated:
    1: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 470283 bytes received
URL: 'http://ipkg.sfbay/dev'.
2: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 473533 bytes received
URL: 'http://ipkg.sfbay/dev'.
3: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 473259 bytes received
URL: 'http://ipkg.sfbay/dev'.
4: Framework error: code: 28 reason: Operation timed out after 30000
milliseconds with 472532 bytes received
URL: 'http://ipkg.sfbay/dev'.

Unable to initialize the pkg image area at /a
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.4/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.4/threading.py", line 638, in __exitfunc
    self._Thread__delete()
  File "/usr/lib/python2.4/threading.py", line 522, in __delete
    del _active[_get_ident()]
KeyError: 8
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.4/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.4/threading.py", line 638, in __exitfunc
    self._Thread__delete()
  File "/usr/lib/python2.4/threading.py", line 522, in __delete
    del _active[_get_ident()]
KeyError: 8
om_perform_install failed with error 114
Auto install failed
Automated Installation failed
Please refer to /tmp/install_log file for  details
[ Jul 13 18:29:26 Method "start" exited with status 95. ]


I tried multiple times and it still fails.
Comment 6 johansen 2009-07-13 19:13:33 UTC
(In reply to comment #5)
> I had seen this timeout error when I try to do AI install with osol_118 from
> ipkg.sfbay/dev.

Please see comment #3 for possible workaround.  The steps for diagnosis are
available on the pkg list:

http://mail.opensolaris.org/pipermail/pkg-discuss/2009-July/015200.html

I'd prefer to keep this bug focused on the technical issues, instead of just
having folks pile on with stack traces.  The fix for this bug is known, and out
for review, as detailed in previous comments.
Comment 7 johansen 2009-07-13 19:57:29 UTC
Integrated 13Jul2009 as change set 49cd5492effc23c57df741ae90ef87f301957000
Comment 8 Shawn Walker 2009-07-15 14:12:46 UTC
*** Bug 10037 has been marked as a duplicate of this bug. ***
Comment 9 Shawn Walker 2009-07-16 08:45:43 UTC
*** Bug 10083 has been marked as a duplicate of this bug. ***
Comment 10 Mary Ding 2009-07-16 10:20:06 UTC
I still cannot do AI install from ipkg.sfbay as a result of this bug.  Can
someone tell me how to apply the workaround in AI miniroot ? The whole point
about doing AI install is handoff and I cannot manually do

export PKG_CLIENT_TIMEOUT=300

for every single AI install.
Comment 11 johansen 2009-07-16 10:54:47 UTC
(In reply to comment #10)
> I still cannot do AI install from ipkg.sfbay as a result of this bug.  Can
> someone tell me how to apply the workaround in AI miniroot ? The whole point
> about doing AI install is handoff and I cannot manually do

If the workaround doesn't work for you, you'll need the fix, which was
integrated on Monday.

If you're able to manipulate how AI executes the pkg command, you can pass the
environment as an argument to exec(2) [execle, execve] or as an argument to
subprocess.Popen (for Python).

This conversation is orthogonal to the bug.  If you have further questions
about working around this, you should post to pkg-discuss@opensolaris.org or
perhaps to the list that AI uses, since I have no idea how AI works.
Comment 12 Alexander Vlasov 2009-07-20 09:34:58 UTC
In the same boat. Blocker for SST testing, since no machine can be installed
via AI.
Comment 13 Mary Ding 2009-08-13 17:27:34 UTC
*** Bug 10644 has been marked as a duplicate of this bug. ***
Comment 14 Rich Burridge 2009-09-11 17:06:23 UTC
*** Bug 10799 has been marked as a duplicate of this bug. ***