Bugzilla – Bug 5003
default dumpadm config is not set to capture crash dumps
Last modified: 2009-08-12 14:32:41 UTC
You need to log in before you can comment on or make changes to this bug.
Once 2008.11 is installed, /etc/dumpadm.conf contains: DUMPADM_ENABLE=no Any panics will therefore not be automatically saved to /var/crash, but have to be captured manually using savecore.
The installer previously was responsible for this so recategorizing. As this affects serviceability, I'm putting this as a potential blocker.
Reset Assignee and QA Contact.
I believe the fix should be to have ICT update_dumpadm_nodename also set DUMPADM_ENABLE=yes The name of the ICT should also change to simply: update_dumpadm
At the moment, this is regarded as not a bug. During discussion of swap and dump sizing when we were converting the installer to use zvol's, it was suggested that, since we're always using a dedicated dump volume, enabling savecore is unnecessary - the dump data will not be competing with swap, and thus reboots after a panic could complete more quickly by avoiding several hundred megabytes (or much more) of I/O. That's what was implemented in the fix for bug 2161. We should continue with this initial path and see how it goes, it can always be revisited once we have experience with it. That prior discussion is in the archives beginning with http://mail.opensolaris.org/pipermail/caiman-discuss/2008-June/004262.html We can release note to users that if they prefer to have crash dumps saved at boot, they can do so; they might also consider instead creating a cron job which does it at some later point.
I sent the release note. Based on Dave Miner's comments I am marking this close/invalid
The ZFS team views this as a bug. We need to ensure that crash dumps are collected automatically on reboot. I'm reopening this so that a fix can be implemented.
Can we get a clearer definition why the ZFS team views this as a bug? The reasons for the current behavior are spelled out quite clearly in comment #4. I don't see any data to suggest that we've evaluated how the experiment is working out. So I'm disinclined to change this behavior until that data is collected and evaluated. To be clear, crash dumps are not lost upon reboot. They just aren't dumped as they were in the past.
George, input would be appreciated on why this is particularly problematic for your team. We don't regard the current behavior of dumpadm as being optimal for a ZFS-rooted system which will update frequently with clones, as the /var/crash area ends up within a BE and thus disk space consumed by the dumps will be difficult to free. I've been discussing these issues with services to arrive at better settings, as well as to consider other changes we might make which would be appropriate. At a minimum, before enabling automatic savecores I'd like to relocate them to an additional file system, such as rpool/crashdump (mounted at /crashdump) and reset dumpadm's config to point there. Alternative naming suggestions welcomed.
Speaking with members of the ZFS team and others, the primary issue is that without dumps debugging is not possible. After a system panics, the focus is on being able to reliably collect the information. If a subsequent panic occurs the first crash will be overwritten and therefore that information will be lost and they have no way to determine that at this point.
In addition to Sanjay's comments, we should not have to ask customers to run 'savecore' manually to collect information about a defect they just submitted. Reducing the back and forth with a customer leads to a better customer experience. We have already seen instances of defects which did not contain the corefiles and we had to contact the customer to gather them. There has been cases where the corefile could not be obtained because the crash had been overwritten. The defect then goes stale unless it can be reproduce.
A possible fix is available but on hold as requested by Dave Miner.
One proposed fix is to update usr/src/cmd/slim-install/finish/install-finish and usr/src/lib/libict_pymod/ict.py as follows: diff -r 81312779d945 usr/src/cmd/slim-install/finish/install-finish --- a/usr/src/cmd/slim-install/finish/install-finish Thu Jan 29 14:52:08 2009 -0700 +++ b/usr/src/cmd/slim-install/finish/install-finish Wed Apr 01 06:44:35 2009 -0600 @@ -197,7 +197,7 @@ sa.append(icto.remove_bootpath()) sa.append(icto.fix_grub_entry()) sa.append(icto.add_other_OS_to_grub_menu()) - sa.append(icto.update_dumpadm_nodename()) + sa.append(icto.update_dumpadm()) sa.append(icto.explicit_bootfs()) sa.append(icto.enable_happy_face_boot()) sa.append(icto.update_boot_archive()) diff -r 81312779d945 usr/src/lib/libict_pymod/ict.py --- a/usr/src/lib/libict_pymod/ict.py Thu Jan 29 14:52:08 2009 -0700 +++ b/usr/src/lib/libict_pymod/ict.py Tue Mar 31 13:47:12 2009 -0600 @@ -120,7 +120,7 @@ ICT_REMOVE_LIVECD_COREADM_CONF_FAILURE, ICT_SET_BOOT_ACTIVE_TEMP_FILE_FAILURE, ICT_FDISK_FAILED, -ICT_UPDATE_DUMPADM_NODENAME_FAILED, +ICT_UPDATE_DUMPADM_FAILED, ICT_ENABLE_NWAM_AI_FAILED, ICT_ENABLE_NWAM_FAILED, ICT_FIX_FAILSAFE_MENU_FAILED, @@ -862,8 +862,8 @@ return ICT_ADD_SPLASH_IMAGE_FAILED return 0 - def update_dumpadm_nodename(self): - '''ICT - Update nodename in dumpadm.conf + def update_dumpadm(self): + '''ICT - Update nodename and DUMPADM_ENABLE in dumpadm.conf Note: dumpadm -r option does not work!! returns 0 for success, error code otherwise ''' @@ -876,41 +876,42 @@ fnode.close() except OSError, (errno, strerror): prerror('Error in accessing ' + nodename + ': ' + strerror) - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED except: prerror('Unrecognized error in accessing ' + nodename) prerror(traceback.format_exc()) #traceback to stdout and log - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED nodename = na[0][:-1] try: (fp, newdumpadmfile) = tempfile.mkstemp('.conf', 'dumpadm', '/tmp') os.close(fp) except OSError, (errno, strerror): prerror('Error in writing to temporary file: ' + strerror) - prerror('Cannot update dumpadm nodename ' + filename) - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Cannot update dumpadm ' + filename) + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED except: - prerror('Unrecognized error - cannot update dumpadm nodename ' + filename) + prerror('Unrecognized error - cannot update dumpadm ' + filename) prerror(traceback.format_exc()) #traceback to stdout and log - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED - status = _cmd_status('cat ' + dumpadmfile + ' | '+ - 'sed s/opensolaris/' + nodename + '/ > ' + newdumpadmfile) + sedcmd = 'sed -e \'s/^DUMPADM_ENABLE=.*/DUMPADM_ENABLE=yes/\' -e \'s/opensolaris/' +\ + nodename + '/\' ' + dumpadmfile + ' > ' + newdumpadmfile + status = _cmd_status(sedcmd) if status != 0: try: os.unlink(newdumpadmfile) except OSError: pass - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED if not _move_in_updated_config_file(newdumpadmfile, dumpadmfile): - prerror('Failure. Returning: ICT_UPDATE_DUMPADM_NODENAME_FAILED') - return ICT_UPDATE_DUMPADM_NODENAME_FAILED + prerror('Failure. Returning: ICT_UPDATE_DUMPADM_FAILED') + return ICT_UPDATE_DUMPADM_FAILED return 0 def explicit_bootfs(self):
not a blocker for 2006.09
Changing priority to reflect decision on addressing this for 2009.06
*** Bug 8928 has been marked as a duplicate of this bug. ***
*** Bug 8983 has been marked as a duplicate of this bug. ***
This was never addressed for 2009.0906 and it needs to be consider for 2010.1002 again.