Bug 5003 - default dumpadm config is not set to capture crash dumps
: default dumpadm config is not set to capture crash dumps
Status: FIXUNDERSTOOD
Product: installer
other
: unspecified
: i86pc/amd64 OpenSolaris
: P3 major (vote)
: 2010.1H
Assigned To: installer watcher
:
:
: rn4, reviewed-2009.06
:
:
: 4637
  Show dependency treegraph
 
Reported: 2008-11-13 08:15 UTC by Brian Ruthven
Modified: 2009-08-12 14:32 UTC (History)
12 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Brian Ruthven 2008-11-13 08:15:12 UTC
Once 2008.11 is installed, /etc/dumpadm.conf contains:

DUMPADM_ENABLE=no

Any panics will therefore not be automatically saved to /var/crash, but have to
be captured manually using savecore.
Comment 1 David Comay 2008-11-15 18:33:28 UTC
The installer previously was responsible for this so recategorizing.  As this
affects serviceability, I'm putting this as a potential blocker.
Comment 2 David Comay 2008-11-15 18:34:41 UTC
Reset Assignee and QA Contact.
Comment 3 JoeV 2008-11-17 04:57:48 UTC
I believe the fix should be to have ICT update_dumpadm_nodename also set
DUMPADM_ENABLE=yes

The name of the ICT should also change to simply: update_dumpadm
Comment 4 Dave Miner 2008-11-17 08:21:09 UTC
At the moment, this is regarded as not a bug.  During discussion of swap and
dump sizing when we were converting the installer to use zvol's, it was
suggested that, since we're always using a dedicated dump volume, enabling
savecore is unnecessary - the dump data will not be competing with swap, and
thus reboots after a panic could complete more quickly by avoiding several
hundred megabytes (or much more) of I/O.  That's what was implemented in the
fix for bug 2161.  We should continue with this initial path and see how it
goes, it can always be revisited once we have experience with it.  That prior
discussion is in the archives beginning with
http://mail.opensolaris.org/pipermail/caiman-discuss/2008-June/004262.html

We can release note to users that if they prefer to have crash dumps saved at
boot, they can do so; they might also consider instead creating a cron job
which does it at some later point.
Comment 5 JoeV 2008-11-17 10:01:50 UTC
I sent the release note.

Based on Dave Miner's comments I am marking this close/invalid
Comment 6 George Wilson 2009-01-29 12:40:31 UTC
The ZFS team views this as a bug. We need to ensure that crash dumps are
collected automatically on reboot. I'm reopening this so that a fix can be
implemented.
Comment 7 Glenn Lagasse 2009-01-30 12:55:47 UTC
Can we get a clearer definition why the ZFS team views this as a bug?

The reasons for the current behavior are spelled out quite clearly in comment
#4.  I don't see any data to suggest that we've evaluated how the experiment is
working out.  So I'm disinclined to change this behavior until that data is
collected and evaluated.

To be clear, crash dumps are not lost upon reboot.  They just aren't dumped as
they were in the past.
Comment 8 Dave Miner 2009-02-02 15:41:18 UTC
George, input would be appreciated on why this is particularly problematic for
your team.  We don't regard the current behavior of dumpadm as being optimal
for a ZFS-rooted system which will update frequently with clones, as the
/var/crash area ends up within a BE and thus disk space consumed by the dumps
will be difficult to free.  I've been discussing these issues with services to
arrive at better settings, as well as to consider other changes we might make
which would be appropriate.

At a minimum, before enabling automatic savecores I'd like to relocate them to
an additional file system, such as rpool/crashdump (mounted at /crashdump) and
reset dumpadm's config to point there.  Alternative naming suggestions
welcomed.
Comment 9 Sanjay Nadkarni 2009-02-02 21:41:45 UTC
Speaking with members of the ZFS team and others, the primary issue is that
without dumps debugging is not possible.  After a system panics, the focus is
on being able to reliably collect the information.  If a subsequent panic
occurs the first crash will be overwritten and therefore that information will
be lost and they have no way to determine that at this point.
Comment 10 George Wilson 2009-02-02 22:07:31 UTC
In addition to Sanjay's comments, we should not have to ask customers to run
'savecore' manually to collect information about a defect they just submitted.
Reducing the back and forth with a customer leads to a better customer
experience.

We have already seen instances of defects which did not contain the corefiles
and we had to contact the customer to gather them. There has been cases where
the corefile could not be obtained because the crash had been overwritten. The
defect then goes stale unless it can be reproduce.
Comment 11 JoeV 2009-02-17 07:41:30 UTC
A possible fix is available but on hold as requested by Dave Miner.
Comment 12 JoeV 2009-04-01 05:30:14 UTC
One proposed fix is to update usr/src/cmd/slim-install/finish/install-finish
and usr/src/lib/libict_pymod/ict.py as follows:

diff -r 81312779d945 usr/src/cmd/slim-install/finish/install-finish
--- a/usr/src/cmd/slim-install/finish/install-finish    Thu Jan 29 14:52:08
2009 -0700
+++ b/usr/src/cmd/slim-install/finish/install-finish    Wed Apr 01 06:44:35
2009 -0600
@@ -197,7 +197,7 @@
        sa.append(icto.remove_bootpath())
        sa.append(icto.fix_grub_entry())
        sa.append(icto.add_other_OS_to_grub_menu())
-       sa.append(icto.update_dumpadm_nodename())
+       sa.append(icto.update_dumpadm())
        sa.append(icto.explicit_bootfs())
        sa.append(icto.enable_happy_face_boot())
        sa.append(icto.update_boot_archive())


diff -r 81312779d945 usr/src/lib/libict_pymod/ict.py
--- a/usr/src/lib/libict_pymod/ict.py   Thu Jan 29 14:52:08 2009 -0700
+++ b/usr/src/lib/libict_pymod/ict.py   Tue Mar 31 13:47:12 2009 -0600
@@ -120,7 +120,7 @@
 ICT_REMOVE_LIVECD_COREADM_CONF_FAILURE,
 ICT_SET_BOOT_ACTIVE_TEMP_FILE_FAILURE,
 ICT_FDISK_FAILED,
-ICT_UPDATE_DUMPADM_NODENAME_FAILED,
+ICT_UPDATE_DUMPADM_FAILED,
 ICT_ENABLE_NWAM_AI_FAILED,
 ICT_ENABLE_NWAM_FAILED,
 ICT_FIX_FAILSAFE_MENU_FAILED,
@@ -862,8 +862,8 @@
                        return ICT_ADD_SPLASH_IMAGE_FAILED
                return 0

-       def update_dumpadm_nodename(self):
-               '''ICT - Update nodename in dumpadm.conf
+       def update_dumpadm(self):
+               '''ICT - Update nodename and DUMPADM_ENABLE in dumpadm.conf
                Note: dumpadm -r option does not work!!
                returns 0 for success, error code otherwise
                '''
@@ -876,41 +876,42 @@
                        fnode.close()
                except OSError, (errno, strerror):
                        prerror('Error in accessing ' + nodename + ': ' +
strerror)
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED
                except:
                        prerror('Unrecognized error in accessing ' + nodename)
                        prerror(traceback.format_exc()) #traceback to stdout
and log
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED
                nodename = na[0][:-1]
                try:
                        (fp, newdumpadmfile) = tempfile.mkstemp('.conf',
'dumpadm', '/tmp')
                        os.close(fp)
                except OSError, (errno, strerror):
                        prerror('Error in writing to temporary file: ' +
strerror)
-                       prerror('Cannot update dumpadm nodename ' + filename)
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Cannot update dumpadm ' + filename)
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED
                except:
-                       prerror('Unrecognized error - cannot update dumpadm
nodename ' + filename)
+                       prerror('Unrecognized error - cannot update dumpadm ' +
filename)
                        prerror(traceback.format_exc()) #traceback to stdout
and log
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED

-               status = _cmd_status('cat ' + dumpadmfile + ' | '+
-                   'sed s/opensolaris/' + nodename + '/ > ' + newdumpadmfile)
+               sedcmd = 'sed -e \'s/^DUMPADM_ENABLE=.*/DUMPADM_ENABLE=yes/\' 
-e \'s/opensolaris/' +\
+                   nodename + '/\' ' + dumpadmfile + ' > ' + newdumpadmfile
+               status = _cmd_status(sedcmd)
                if status != 0:
                        try:
                                os.unlink(newdumpadmfile)
                        except OSError:
                                pass
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED

                if not _move_in_updated_config_file(newdumpadmfile,
dumpadmfile):
-                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_NODENAME_FAILED')
-                       return ICT_UPDATE_DUMPADM_NODENAME_FAILED
+                       prerror('Failure. Returning:
ICT_UPDATE_DUMPADM_FAILED')
+                       return ICT_UPDATE_DUMPADM_FAILED
                return 0

        def explicit_bootfs(self):
Comment 13 JoeV 2009-04-13 09:45:04 UTC
not a blocker for 2006.09
Comment 14 David Comay 2009-04-13 10:03:39 UTC
Changing priority to reflect decision on addressing this for 2009.06
Comment 15 Dave Miner 2009-05-14 06:12:48 UTC
*** Bug 8928 has been marked as a duplicate of this bug. ***
Comment 16 virginia.wray 2009-05-25 16:03:15 UTC
*** Bug 8983 has been marked as a duplicate of this bug. ***
Comment 17 Mary Ding 2009-08-12 14:32:41 UTC
This was never addressed for 2009.0906 and it needs to be consider for
2010.1002 again.