Bug 948 - image.get_link_actions() the cause of millions of syscalls during a fresh install
: image.get_link_actions() the cause of millions of syscalls during a fresh ins...
Status: CLOSED FIXED
Product: pkg
cli
: unspecified
: ANY/Generic Solaris
: P3 normal (vote)
: ---
Assigned To: Bart Smaalders
: pkg/cli watcher
:
: zones
:
:
:
  Show dependency treegraph
 
Reported: 2008-04-01 23:36 UTC by Dan Price
Modified: 2008-04-09 10:17 UTC (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Dan Price 2008-04-01 23:36:48 UTC
While doing work on IPS I discovered a large number of system calls being made
during the "creating plan" phase of a fresh installation of the
'redistributable' cluster of packages.

I used the following crude dtrace script to track down the problem:

-----
#!/usr/sbin/dtrace -Fs

python$1:::function-entry
/copyinstr(arg1) == $$2/
{
        self->t=1;
        calls++;
}

python$1:::function-return
/copyinstr(arg1) == $$2 && self->t/
{
        self->t=0;
}

syscall:::entry
/self->t/
{
        @a[probefunc]=count()
}

END
{
        printf("calls: %d", calls);
}
-----

No points for elegance but it works.

Successive guesses allowed me to narrow this down to the get_link_actions()
routine.  Here's an example of its system call impact during the following
sequence:

# rm -fr /tmp/foo
# export PKG_IMAGE=/tmp/foo
# pkg image-create -a foo=http://ipkg.eng /tmp/foo
# pkg install -n redistributable

I ran the D script as follows:

$ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions

And once the command was done, got this output:

CPU     ID                    FUNCTION:NAME
  0      2                             :END calls: 880

  brk                                                              26
  stat64                                                       774400
  close                                                        775280
  fcntl                                                        775280
  fsat                                                         775280
  fstat64                                                      775280
  getdents64                                                  1553200


So here we can see that something is amiss-- this routine is the source
of 4.7 *million* syscalls.

Here's the code:

        def get_link_actions(self):
                """ return a dictionary of hardlink action lists indexed by
                target """
(a)             if self.link_actions:
                        return self.link_actions

                d = {}
                for act in self.gen_installed_actions():
                        if act.name == "hardlink":
                                t = act.get_target_path()
                                if t in d:
                                        d[t].append(act)
                                else:
                                        d[t] = [act]

(b)             self.link_actions = d
                return d


When Bart and I first looked at this, it seemed by inspection to be working as
designed-- the first time the routine is called, we populate a cache, and
subsequently the cache is returned.  At some point Bart went "Aha!" and we
realized that the problem here was that the comparison at (a) wasn't working
right, and that, for a not-yet-installed image, the link_actions dictionary is
exactly equal to {}, the empty dictionary.  That evaluates to "false" in
python.  So we always think that we haven't yet filled out the link_actions
cache, and go to fill it.  That sends us into gen_installed_actions(), which is
expensive.  It seems to get more and more expensive as we accumulate more
manifests, as well.

So, while there are exactly zero link actions we need to worry about, we rescan
all manifests for every new manifest we evaluate.

This is a situation which will crop up most often with zones, since that's the
place we're most likely to create new images from scratch.

The fix is very simple: set self->link_actions to None initially (in the object
constructor), and change the comparison at (a) to test against None explictly.

Having fixed this, the impact of get_link_actions is:

-----
CPU     ID                    FUNCTION:NAME
  0      2                             :END calls: 880

  stat64                                                            3
  close                                                             4
  fcntl                                                             4
  fsat                                                              4
  fstat64                                                           4
  getdents64                                                        8
-----

Or, basically, nothing.
Comment 1 Dan Price 2008-04-01 23:50:27 UTC
I forgot to add that in some informal testing I got these numbers with
ptime(1):

Before:

real     1:26.730
user     1:10.988
sys        11.624

After:

real     1:12.932
user     1:02.475
sys         6.361


Or about a 19% improvement in real time.  Not a huge win, but not too shabby
either.
Comment 2 Bart Smaalders 2008-04-09 10:17:14 UTC
Fix is in 3a857fd8b787, 402042d6c744.