Bug 8883 - AI seems to be writing a bad partition table to xVM disks
: AI seems to be writing a bad partition table to xVM disks
Status: RESOLVED FIXINSOURCE
Product: installer
autoinstall
: unspecified
: Xen/amd64 All
: P2 critical (vote)
: ---
Assigned To: installer watcher
:
:
:
:
:
: 6372
  Show dependency treegraph
 
Reported: 2009-05-13 06:03 UTC by Tim Foster
Modified: 2009-05-17 15:58 UTC (History)
3 users (show)

See Also:


Attachments
symptomatic zpool create hang (1.39 KB, text/plain)
2009-05-13 06:03 UTC, Tim Foster
no flags Details
The AI install_log for a hanging AI install (1.74 KB, text/plain)
2009-05-13 06:04 UTC, Tim Foster
no flags Details
The AI application-auto-installer:default.log SMF log (4.15 KB, text/plain)
2009-05-13 06:05 UTC, Tim Foster
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Tim Foster 2009-05-13 06:03:51 UTC
Created an attachment (id=1932) [details]
symptomatic zpool create hang

We were seeing that all AI installs of PV 2009.06 111b guests were hanging
early on when creating the zpool on the disk assigned to the guest.  The
symptoms inside the guest looked like the attachment "hung-zpool-create.txt"

To reproduce, I created a 16gb file, using 

# mkfile -n 16g /xvmpool/ai_pv.raw

then started a virt-install

# virt-install -n ai_pv -r 1024 -p -f /xvmpool/ai_pv.raw -b nge1 -l
http://x.y.z.a:5555/var/tmp/AI/targets/x86_svc --autocf install_service=x86_svc

After the guest had hung, I quit the install, and destroyed the guest:

# virsh destroy ai_pv
# virsh undefine ai_pv

then created a new guest, this time booting to single-user mode:

# virt-install -n ai_pv -r 1024 -p -f /xvmpool/ai_pv.raw -b nge1 -l
http://x.y.z.a:5555/var/tmp/AI/targets/x86_svc --autocf install_service=x86_svc
--extra-args -s

From there, I examined the partition table on the disk, and verified we 
couldn't newfs the disk:

root@opensolaris:~# prtvtoc /dev/rdsk/c7t0d0p0

* /dev/rdsk/c7t0d0p0 partition map

*

* Dimensions:

*     512 bytes/sector

*      63 sectors/track

*     255 tracks/cylinder

*   16065 sectors/cylinder

*    2087 cylinders

*    2087 accessible cylinders

*

* Flags:

*   1: unmountable

*  10: read-only

*

* Unallocated space:

*    First     Sector    Last

*    Sector     Count    Sector 

*           0     16065     16064

*

*                          First     Sector    Last

* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory

       0      0    00      16065  33527655  33543719

       2      5    01          0  33554432  33554431

       8      1    01          0     16065     16064

root@opensolaris:~# newfs /dev/dsk/c7t0d0s0

newfs: construct a new file system /dev/rdsk/c7t0d0s0: (y/n)? y

read error on sector 33527654: I/O error


Destroying the guest, then trying again with the single-user mode boot, this
time with a new disk:

# virsh destroy ai_pv
# virsh undefine ai_pv
# rm /xvmpool/ai_pv.raw
# mkfile -n 16g /xvmpool/ai_pv.raw
# virt-install -n ai_pv -r 1024 -p -f /xvmpool/ai_pv.raw -b nge1 -l
http://x.y.z.a:5555/var/tmp/AI/targets/x86_svc --autocf install_service=x86_svc
--extra-args -s

we saw the following disk layout, and verified that we were able to newfs the
disk:

root@opensolaris:~# prtvtoc /dev/dsk/c7t0d0p0
* /dev/dsk/c7t0d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*    2088 cylinders
*    2088 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      0    00          0  33554432  33554431
       2      5    01          0  33554432  33554431
       8      1    01          0     16065     16064
root@opensolaris:~# newfs /dev/dsk/c7t0d0s0
newfs: construct a new file system /dev/rdsk/c7t0d0s0: (y/n)? y
Warning: 4096 sector(s) in last cylinder unallocated
/dev/rdsk/c7t0d0s0:    33554432 sectors in 5462 cylinders of 48 tracks, 128
sectors
    16384.0MB in 342 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
......
super-block backups for last 10 cylinder groups at:
 32638496, 32736928, 32835360, 32933792, 33032224, 33130656, 33229088,
 33327520, 33425952, 33524384

I'll attach the install_log and auto-installer SMF log files, the former has
the following, which might be relevant:

<TDDM_E May 13 07:06:45> ddm_drive_get_ctype():Can't get DM_CONTROLLER assoc.
w/ DM_DRIVE, err=0
<OM May 13 07:06:55> System reports enough physical memory for installation,
swap is optional
<AI May 13 07:06:55> Checking any disks for minimum recommended size of 12646
MB<AI May 13 07:06:55> Disk c7t0d0 size listed as 16384 MB
<AI May 13 07:06:55> Default disk selected is c7t0d0
<AI May 13 07:06:55> Cannot find the partitions for disk c7t0d0 on the target
system
<OM May 13 07:06:55> No disk partitions defined prior to install
<AI May 13 07:06:55> Disk name selected for installation is c7t0d0
Comment 1 Tim Foster 2009-05-13 06:04:50 UTC
Created an attachment (id=1933) [details]
The AI install_log for a hanging AI install
Comment 2 Tim Foster 2009-05-13 06:05:43 UTC
Created an attachment (id=1934) [details]
The AI application-auto-installer:default.log SMF log
Comment 3 Ethan Quach 2009-05-13 13:40:17 UTC
Does virt-install create the partition table and an initial vtoc label on
the given disk file?

Does not using the -n during the mkfile make a difference?

In your second test (with the fresh disk file), if you do a zpool
create instead of newfs, what is the result.

(marking incomplete to get this data)

I will also be trying to reproduce this here.
Comment 4 Ethan Quach 2009-05-13 15:39:12 UTC
A couple of updates...

This bug appears to happen regardless of the whether the underlying
guest disk is a file or a zvol.

This bug appears to have been introduced in 111a.  I wasn't able to
reproduce this with 111.
Comment 5 David Comay 2009-05-13 20:35:09 UTC
Adding this to the blocker list while we continue triaging it.
Comment 6 Tim Foster 2009-05-14 05:49:33 UTC
(In reply to comment #3)
> Does virt-install create the partition table and an initial vtoc label on
> the given disk file?

No, I think we leave that to the installer.

> Does not using the -n during the mkfile make a difference?

Nope, we get the same behaviour with complete files as sparse files.

> In your second test (with the fresh disk file), if you do a zpool
> create instead of newfs, what is the result.

The zpool create succeeds as expected, though I do need a -f flag (which AI
uses too however)

root@opensolaris:~# zpool create rpool c7t0d0s0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c7t0d0s0 overlaps with /dev/dsk/c7t0d0s2
root@opensolaris:~# zpool create -f rpool c7t0d0s0
root@opensolaris:~# zpool status -v
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      c7t0d0s0  ONLINE       0     0     0

errors: No known data errors
Comment 7 Ethan Quach 2009-05-14 08:25:01 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > Does virt-install create the partition table and an initial vtoc label on
> > the given disk file?
> 
> No, I think we leave that to the installer.

Reason why I ask this is because in this test case, you've provided
a fresh disk file, and booted -s, so nothing has touched the disk
yet, but prtvtoc seems to have returned a valid table (maybe it
defaults to some initial table perhaps.)

> 
> > In your second test (with the fresh disk file), if you do a zpool
> > create instead of newfs, what is the result.
> 
> The zpool create succeeds as expected, though I do need a -f flag (which AI
> uses too however)

So far, this appears to be a zpool issue (though maybe in combination
with something the installer is doing since with an untouched partition
table, the zpool creation succeeds.)  We're getting some zfs folks
involved to help diagnose this ...
Comment 8 Ethan Quach 2009-05-14 09:32:49 UTC
For this given disk size scenario, the installer does seem to be
writing a different vtoc label in 111a as compared to 111.

In 111, the fdisk partition is of length 2087 cylinders, and the vtoc label
written seems to be correct for that (note slice 2 in the vtoc table):

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1       Active    Solaris2          1  2087    2087    100


Current partition table (original):
Total disk cylinders available: 2087 + 0 (reserved cylinders)

Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm     132 - 2086       14.98GB    (1955/0/0) 31407075
  1       swap    wu       1 -  131        1.00GB    (131/0/0)   2104515
  2     backup    wu       0 - 2086       15.99GB    (2087/0/0) 33527655
  3 unassigned    wm       0               0         (0/0/0)           0
  4 unassigned    wm       0               0         (0/0/0)           0
  5 unassigned    wm       0               0         (0/0/0)           0
  6 unassigned    wm       0               0         (0/0/0)           0
  7 unassigned    wm       0               0         (0/0/0)           0
  8       boot    wu       0 -    0        7.84MB    (1/0/0)       16065
  9 unassigned    wm       0               0         (0/0/0)           0


In 111a and 111b, the fdisk partition is still length 2087 cynlinders,
but the vtoc label written appears to be of length 2088 (note slice 2):

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1       Active    Solaris2          1  2087    2087    100


Current partition table (original):
Total disk cylinders available: 2087 + 0 (reserved cylinders)

Part      Tag    Flag     Cylinders        Size            Blocks
  0 unassigned    wm       1 - 2087       15.99GB    (2087/0/0) 33527655
  1 unassigned    wm       0               0         (0/0/0)           0
  2     backup    wu       0 - 2087       16.00GB    (2088/170/2) 33554432
  3 unassigned    wm       0               0         (0/0/0)           0
  4 unassigned    wm       0               0         (0/0/0)           0
  5 unassigned    wm       0               0         (0/0/0)           0
  6 unassigned    wm       0               0         (0/0/0)           0
  7 unassigned    wm       0               0         (0/0/0)           0
  8       boot    wu       0 -    0        7.84MB    (1/0/0)       16065
  9 unassigned    wm       0               0         (0/0/0)           0



I don't know if that's what causes zpool to hang, but George is looking
at that right now.  Either way, it does appear that something changed in
the installer in 111a such thata  different size is written in vtoc label.
Comment 9 Ethan Quach 2009-05-14 09:50:02 UTC
> 
> I don't know if that's what causes zpool to hang, but George is looking
> at that right now.  Either way, it does appear that something changed in
> the installer in 111a such thata  different size is written in vtoc label.

Just want to clarify that it might not be something in the installer that
changed, but rather something in 111a changed such that a +1 sized vtoc
label is written.  This obviously doesn't happen on bare metal.

The update from George is that it ... might be related as we've tried to
read from the labels at the end of the disk. This has resulted in an EIO
which then gets us to call into the driver to see if the drive was removed:

       if (zio->io_error == EIO) {
               vdev_disk_t *dvd = vd->vdev_tsd;
               int state = DKIO_NONE;

               if (ldi_ioctl(dvd->vd_lh, DKIOCSTATE, (intptr_t)&state,
                   FKIOCTL, kcred, NULL) == 0 && state != DKIO_INSERTED) {
                       vd->vdev_remove_wanted = B_TRUE;
                       spa_async_request(zio->io_spa, SPA_ASYNC_REMOVE);
               }
       }

This ioctl() is stuck in the driver:

> 0xd4ee1dc0::findstack -v
stack pointer for thread d4ee1dc0: d4ee1ad8
 d4ee1b18 swtch+0x188()
 d4ee1b28 cv_wait+0x53(d2ec2de8, d2ec2d6c, 0, 0)
 d4ee1b68 cv_wait_sig+0x260(d2ec2de8, d2ec2d6c, 0, 0)
 d4ee1ba8 xdf_dkstate+0x45(d2ec2c00, 0, 4, 80100000)
 d4ee1c58 xdf_ioctl+0x1e8()
 d4ee1c88 cdev_ioctl+0x31(3040000, 40d, d4ee1cec, 80100000, d2e3be78, 0)
 d4ee1cb8 ldi_ioctl+0xa2(d4da2030, 40d, d4ee1cec, 80000000, d2e3be78, 0)
 d4ee1cf8 vdev_disk_io_done+0x43(d4857778, d4857778)
 d4ee1d28 zio_vdev_io_done+0xe0(d4857778, d2e342a8, d4ee1d48, d4604430)
 d4ee1d48 zio_execute+0x81(d4857778, d4604430)
 d4ee1da8 taskq_thread+0x192(d2e34288, 0)
 d4ee1db8 thread_start+8()

It seems like there's a bug in the Xen xdf_dkstate() function. Can the Xen
guys take a look? 

So can someone from the xvm team take a look at this please.
Comment 10 Dave Miner 2009-05-14 09:53:23 UTC
We should at least take a look at the fix for bug 7758, as it's not obvious to
me why we're creating swap in the old case and not in the new.  Or was that a
result of the workaround provided by the bug 7718 fix that was removed?
Comment 11 Ethan Quach 2009-05-14 10:09:31 UTC
(In reply to comment #10)
> We should at least take a look at the fix for bug 7758, as it's not obvious to
> me why we're creating swap in the old case and not in the new.  Or was that a
> result of the workaround provided by the bug 7718 fix that was removed?

7718 itself was the workaround, that's why 111 has the swap slice created.
This was removed in 111a.

7758 is indeed a totally separate bug, that was fixed in 111a so I'll take
a look at that.
Comment 12 David Comay 2009-05-14 10:37:49 UTC
Has a Live CD PV install been attempted and if so, what was the result?
Comment 13 Tim Foster 2009-05-14 12:28:43 UTC
Yep, our test folks have done pv gui installs of nv_111a.
Comment 14 David Comay 2009-05-14 12:36:13 UTC
By nv_111a, I assume you mean OpenSolaris and not Nevada, right?

For completeness, we should also verify the same for OpenSolaris snv_111b.
Comment 15 Ethan Quach 2009-05-14 14:40:29 UTC
We looked at some recent pushes in the caiman gate and bug 7758
looked like it modified code which could have caused the slice
sizes in the vtoc label to have changed for bld 111a.  So we built
an AI image with this backed out, and the problem went away.  So
at this point we are treating this as a regression introduced
by 7758, and we're looking at a fix in this area.
Comment 16 Ethan Quach 2009-05-14 23:41:27 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #3)
> > > Does virt-install create the partition table and an initial vtoc label on
> > > the given disk file?
> > 
> > No, I think we leave that to the installer.
> 
> Reason why I ask this is because in this test case, you've provided
> a fresh disk file, and booted -s, so nothing has touched the disk
> yet, but prtvtoc seems to have returned a valid table (maybe it
> defaults to some initial table perhaps.)

This turns out to be a contributing factor to the bug in the installer.
Before the disk is even touched, why does prtvtoc report there to be
a label there, especially something for slice 0?  This appears to be what
confuses the installer.  On bare metal, prtvtoc does not report anything
when given a blank disk, so this behavior appears to be specific in xvm
guests.  Is this expected?  a bug?


What's happening is that the installer is checking for a vtoc label
up front, and its getting exactly what prtvtoc is showing above, in the
second test case where the disk hadn't been touched yet -- slice 0
being 2088 cylinders, or 33554432 sectors.

   root@opensolaris:~# prtvtoc /dev/dsk/c7t0d0p0
   * /dev/dsk/c7t0d0p0 partition map
   *
   * Dimensions:
   *     512 bytes/sector
   *      63 sectors/track
   *     255 tracks/cylinder
   *   16065 sectors/cylinder
   *    2088 cylinders
   *    2088 accessible cylinders
   *
   * Flags:
   *   1: unmountable
   *  10: read-only
   *
   *                          First     Sector    Last
   * Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
          0      0    00          0  33554432  33554431
          2      5    01          0  33554432  33554431
          8      1    01          0     16065     16064


Then when it comes time to start installing and writing to the disk, 
the installer has to create an solaris2 fdisk partition, and it creates
that with "fdisk -n -B cXtXdXp0", which creates a solaris2 partition
on the whole disk, but avoids cylinder 0.  This means the solaris2
partition is only 2087 cylinders long:

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1       Active    Solaris2          1  2087    2087    100


Next it writes the vtoc label, but since it thinks there was an existing
slice 0 (from that upfront query for a vtoc label)  it uses that slice
sizing information to write the label, and obviously 2088 doesn't fit in
the 2087 partition.  This is the part that's the bug in the installer.
If it had to create the fdisk partition, it should use better logic on
whether it should trust whatever vtoc information it had gathered
upfront.


But still, it seems like there really shouldn't be any vtoc information
returned when the disk is blank.
Comment 17 Tim Foster 2009-05-15 01:00:28 UTC
I think the default label comes from xdf_clmb_attach()->clmb_attach() by
specifying CMLB_FAKE_LABEL_ONE_PARTITION.
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/xen/io/xdf.c#590
I note it differs in PV vs. HVM domains, but this doesn't strike me as bad:
brand new lofi devices also come with a default partition table, for example.
Comment 18 David Comay 2009-05-15 11:02:00 UTC
Tim, is this issue only seen with PV guests?  Perhaps that's been stated
somewhere but I didn't see it called out explicitly.

Also, I thought that PV guests with AI had other necessary changes.  Is that
not the case?

From my perspective, what's important to fix right now is any code which
prevents the booting of the AI image, the subsequent installation of the guest
and anything which prevents the network from work after the fact (in order to
download sustaining fixes).  Whether the issue is the installer or the vitual
disk driver or something else, I'd like to understand if there are remaining
*client-side* issues that prevent PV guests from performing these steps.
Comment 19 Tim Foster 2009-05-15 11:44:36 UTC
(In reply to comment #18)
> Tim, is this issue only seen with PV guests?  Perhaps that's been stated
> somewhere but I didn't see it called out explicitly.

I've only seen the problem with PV guests. In theory it's possible other
platforms presenting similar looking disks to the installer could hit this. It
might be good to verify that VirtualBox and VMware guests don't hit this
problem. [ HVM guests don't, for example ]

> Also, I thought that PV guests with AI had other necessary changes.  Is that
> not the case?

Yes, there are virt-install changes pending that allow users to initiate PV
guest AI installs, which haven't putback yet (part of the xVM 3.3 work). It is
possible to start an AI install of a PV guest even from an nv_111 dom0 without
these changes, it's just a bit more complex.

However the contents of the AI image itself are where the problem lies:
so in the future with our changes available, in, say an nv_116 dom0, we
wouldn't
be able to install a PV guest from a stock 2009.06_111b AI image.
Another AI image would need to be produced at a later date for PV guest
installs to work.

> From my perspective, what's important to fix right now is any code which
> prevents the booting of the AI image, the subsequent installation of the guest
> and anything which prevents the network from work after the fact (in order to
> download sustaining fixes).  Whether the issue is the installer or the vitual
> disk driver or something else, I'd like to understand if there are remaining
> *client-side* issues that prevent PV guests from performing these steps.

So yes, I think this fix would qualify for the above: without the fix in the AI
image (booted from the AI server, running on the client), we never get to
install.
Comment 20 David Comay 2009-05-15 11:53:59 UTC
Thanks Tim.  Are we aware of any additional changes that live on the image that
prevent PV boot/install from taking place then?
Comment 21 Tim Foster 2009-05-15 12:04:59 UTC
No problem, I'm not aware of any yet. It'd be nice to test the fix in a real AI
image to be sure.
Comment 22 Ethan Quach 2009-05-15 12:14:15 UTC
We have a potential fix in the installer for this.  Basically, for
cases where the AI installer finds no fdisk partition, it shouldn't
try to read a vtoc label off the disk.
Comment 23 Ethan Quach 2009-05-17 15:58:33 UTC
Fixed in changesets

slim_source: cfb2652634d609b0ea7502d0cb997e47a67015a9
slim_0906:   f36b9d6820761ae7c0cf2d782285e9d839b519cc