Bugzilla – Bug 948
image.get_link_actions() the cause of millions of syscalls during a fresh install
Last modified: 2008-04-09 10:17:14 UTC
You need to log in before you can comment on or make changes to this bug.
While doing work on IPS I discovered a large number of system calls being made during the "creating plan" phase of a fresh installation of the 'redistributable' cluster of packages. I used the following crude dtrace script to track down the problem: ----- #!/usr/sbin/dtrace -Fs python$1:::function-entry /copyinstr(arg1) == $$2/ { self->t=1; calls++; } python$1:::function-return /copyinstr(arg1) == $$2 && self->t/ { self->t=0; } syscall:::entry /self->t/ { @a[probefunc]=count() } END { printf("calls: %d", calls); } ----- No points for elegance but it works. Successive guesses allowed me to narrow this down to the get_link_actions() routine. Here's an example of its system call impact during the following sequence: # rm -fr /tmp/foo # export PKG_IMAGE=/tmp/foo # pkg image-create -a foo=http://ipkg.eng /tmp/foo # pkg install -n redistributable I ran the D script as follows: $ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions And once the command was done, got this output: CPU ID FUNCTION:NAME 0 2 :END calls: 880 brk 26 stat64 774400 close 775280 fcntl 775280 fsat 775280 fstat64 775280 getdents64 1553200 So here we can see that something is amiss-- this routine is the source of 4.7 *million* syscalls. Here's the code: def get_link_actions(self): """ return a dictionary of hardlink action lists indexed by target """ (a) if self.link_actions: return self.link_actions d = {} for act in self.gen_installed_actions(): if act.name == "hardlink": t = act.get_target_path() if t in d: d[t].append(act) else: d[t] = [act] (b) self.link_actions = d return d When Bart and I first looked at this, it seemed by inspection to be working as designed-- the first time the routine is called, we populate a cache, and subsequently the cache is returned. At some point Bart went "Aha!" and we realized that the problem here was that the comparison at (a) wasn't working right, and that, for a not-yet-installed image, the link_actions dictionary is exactly equal to {}, the empty dictionary. That evaluates to "false" in python. So we always think that we haven't yet filled out the link_actions cache, and go to fill it. That sends us into gen_installed_actions(), which is expensive. It seems to get more and more expensive as we accumulate more manifests, as well. So, while there are exactly zero link actions we need to worry about, we rescan all manifests for every new manifest we evaluate. This is a situation which will crop up most often with zones, since that's the place we're most likely to create new images from scratch. The fix is very simple: set self->link_actions to None initially (in the object constructor), and change the comparison at (a) to test against None explictly. Having fixed this, the impact of get_link_actions is: ----- CPU ID FUNCTION:NAME 0 2 :END calls: 880 stat64 3 close 4 fcntl 4 fsat 4 fstat64 4 getdents64 8 ----- Or, basically, nothing.
I forgot to add that in some informal testing I got these numbers with ptime(1): Before: real 1:26.730 user 1:10.988 sys 11.624 After: real 1:12.932 user 1:02.475 sys 6.361 Or about a 19% improvement in real time. Not a huge win, but not too shabby either.
Fix is in 3a857fd8b787, 402042d6c744.