Bug 9909 - 4 hours of 'Jul 7 12:09:30 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata) instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 2' during boot
: 4 hours of 'Jul 7 12:09:30 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ...
Status: CLOSED TRACKEDINBUGSTER
Product: opensolaris
kernel
: 200906
: Xen/amd64 OpenSolaris
: P3 major with 1 vote (vote)
: ---
Assigned To: Watcher account for OpenSolaris kernel bugs
:
: http://malsserver.blogspot.com/2009/0...
: BugsterCR=6761786
:
:
:
  Show dependency treegraph
 
Reported: 2009-07-07 18:37 UTC by Malachi de AElfweald
Modified: 2009-08-24 18:04 UTC (History)
6 users (show)

See Also:


Attachments
prtconf (273.07 KB, text/plain)
2009-07-07 18:37 UTC, Malachi de AElfweald
no flags Details
prtdiag (1.23 KB, text/plain)
2009-07-07 18:37 UTC, Malachi de AElfweald
no flags Details
dmesg (702.40 KB, text/plain)
2009-07-07 18:38 UTC, Malachi de AElfweald
no flags Details
dtrace of ata_detach + genunix (313.32 KB, application/octet-stream)
2009-08-19 01:57 UTC, Malachi de AElfweald
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Malachi de AElfweald 2009-07-07 18:37:16 UTC
Created an attachment (id=2226) [details]
prtconf

I have been trying to troubleshoot this since May 10th.  Originally it was on
200811, but I upgraded to 200906 while trying to fix it.  I upgraded via the
dev IPS last night and still have the same problem.

In essence, during bootup it gets to 'Hostname: serveris' then just sits there
for 4 hours before it goes any further.

dmesg is showing these lines over and over for 4 hours:
Jul  7 08:15:57 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 0
Jul  7 08:15:57 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 1
Jul  7 08:15:57 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 2
Jul  7 08:15:57 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 3

I find it especially interesting since there is nothing plugged into IDE 1. IDE
0 is used by the CD, which is mounting fine on bootup; and the rest of the
drives are SATA.

I do *not* have this problem if I boot onto a LiveCD.
Comment 1 Malachi de AElfweald 2009-07-07 18:37:55 UTC
Created an attachment (id=2227) [details]
prtdiag
Comment 2 Malachi de AElfweald 2009-07-07 18:38:44 UTC
Created an attachment (id=2228) [details]
dmesg
Comment 3 Jürgen Keil 2009-07-08 11:36:10 UTC
Is there a similar hang when you boot on metal?
Comment 4 Jürgen Keil 2009-07-08 11:49:32 UTC
Can you try to boot xVM with the kernel debugger enabled (kernel options
"-kv"),
and when it hangs / is in the state where it reports 

    xVM_psm: ide (ata) instance 1 irq 0xf vector 0x98 ioapic 0x4 intin ...

try to enter kmdb by pressing the F1 + 'a' key.  Does that enter kmdb?
If that does enter kmdb, try these commands

    apic_bind_intr:b

This sets a brakpoint in the  apic_bind_intr function, which is the function
that should print the "ide (ata) instance 1 irq 0xf vector 0x98 ioapic 0x4
intin ..." message.


    :c
    $C

    :c
    $C

The :c continue command should run the kernel and we should stop at the
apic_bind_intr breakpoint.  $C prints a stack backtrace.  Can you attach
a screendump (e.g. digital camera picture, or a log file from the serial
port console) of the stack backtrace?
Comment 5 Malachi de AElfweald 2009-07-08 12:52:09 UTC
As far as to 'when I boot on bare metal'.... I am not sure I follow.  It
happens when I boot from the SATA drives (regardless if I choose xVM or not)
but does not happen when I boot from the IDE cdrom onto a livecd.  But this is
the Dom0 not a DomU.

Personally, I am wondering why it tries about 1000 times an hour during boot
up.  You would think that any loop to try (ignoring the fact that the device it
is polling doesn't even exist) should have figured out to stop after the first
few tries.


I can't do that test at the moment (I am at work and the machine is at home). 
I had tried doing a debug boot before (see
http://malsserver.blogspot.com/2009/05/server-died.html ).

However, it did happen again just before midnight even though the system had
already been running about 12 hours....
Jul  7 21:50:12 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 2
so whatever caused it *might* be detectable on the running system....
Comment 6 Malachi de AElfweald 2009-07-14 01:56:55 UTC
BTW while updating to 118, I tried disabling IDE1 in the BIOS and it made no
difference
Comment 7 Rich Burridge 2009-07-23 12:10:57 UTC
In order to get the attention of the Solaris engineers, I've
opened a bug in the Sun Bugster database, and am closing this one
as TRACKEDINBUGSTER.

If the Bugster bug is now in a public category/sub-category, the
URL of the new bug at bugs.opensolaris.org will be:

http://bugs.opensolaris.org/view_bug.do?bug_id=6863859

(This will happen the next time the new Bugster bug information
is pushed to bugs.opensolaris.org).

I've added the bug submitter to the Bugster interest list. If you'd
like to be removed, just let me know (by adding a comment to this bug).

If anybody else would like to be added to the cc: of the Bugster bug,
again just let me know (by adding a comment to this bug).

Note that there is currently no easy way for external-to-Sun
people to update the Bugster bug. That's been worked on. In the
meantime, if a Bugster evaluator asks you for further information
and you are external-to-Sun, then I suggest emailing that person
directly.

Thanks.
Comment 8 Kerry Shu 2009-07-23 15:39:53 UTC
(In reply to comment #5)
> 
> Personally, I am wondering why it tries about 1000 times an hour during boot
> up.  You would think that any loop to try (ignoring the fact that the device it
> is polling doesn't even exist) should have figured out to stop after the first
> few tries.

According to the dmesg, there are two ata instances on your system. ata0 is
your CD-ROM and is attached OK. ata1 is attached failure. Looks like there is
some kind of thread or timeout trying to attach ata1 again and again, so we saw
thousands of the ata1 interrupt binding messages.

Yes, pls boot with debugger enabled(appending "-kv" to GRUB kernel$ line). When
you see the message shows out, press 'F1 + a' to enter kmdb debugger. Then try
below:
[0]> ata_detach:b
[0]> :c
When the breakpoint is hit, it should enter debugger again, pls do below
command for the calling stack trace:
[0]> $c

Thanks,
Kerry
Comment 9 Andy 2009-07-26 16:52:03 UTC
As this is causing me some grief, please can you add me
spamtrap@stormcrow.org.uk to the bugster cc?  

Thanks

Andy
Comment 10 Rich Burridge 2009-07-27 08:02:35 UTC
Hi Andy,

(In reply to comment #9)
> As this is causing me some grief, please can you add me
> spamtrap@stormcrow.org.uk to the bugster cc?

Done.
Comment 11 Jürgen Keil 2009-07-27 11:31:51 UTC
(In reply to comment #8)

> According to the dmesg, there are two ata instances on your system. ata0 is
> your CD-ROM and is attached OK. ata1 is attached failure.

Looking at ASUS' L1N64-SLI manual[*] the mainboard has *one* p-ata connector.
I guess the ata1 channel shouldn't exist at all.



[*]
http://dlcdnet.asus.com/pub/ASUS/mb/Socket%20F/L1N64-SLI%20WS/e3014_l1n64-sli_ws.pdf
Comment 12 Malachi de AElfweald 2009-08-11 12:34:53 UTC
I received an email that says
http://bugs.opensolaris.org/view_bug.do?bug_id=6863859 was updated:

*** (#1 of 2): 2009-07-29 23:40:21 GMT+00:00 <User 1-UM-1864>
downgrade the CR to P4 until we get response from submitter.

However, that link shows no changes and does not allow any input.

I do see that there is a request for me to reboot and go into kernel mode
debugging to try to gather more information.  I do intend to do so; but as I
have to take 3 domains offline for 5+ hours to do so, I have been holding off
until there has been a need to reboot.

I also had thought that there was no longer any need to do so since Jurgen had
commented that the motherboard manual says that the device doesn't exist.

Also, Lester Cheung commented on my blog that he got the same behavior after
unplugging a DVD drive... but in my case, there has never been anything on
IDE1.  So, in regards to Kerry's comment... it SHOULD be an attachment failure,
shouldn't it -- since nothing is plugged into it? But it shouldn't keep trying
1000s of times.

Anyways, let me know if you still need me to do the reboot thing.  I am really
leary of doing it without a fix in place since it hoses all my domains for 5
hours.

Since the trackedinbugster bug emails say they are not going to work on it
until more information is provided -- and more information can not be provided
in bugster, I am reopening this bug since the communication evidentally has to
happen here instead.
Comment 13 Kerry Shu 2009-08-11 12:54:20 UTC
Thanks for the update. We can communicate here on this issue no matter if this
Bugzilla bug is closed or open. Since I've got your response, I'll continue to
work on the Bugster CR 6863859.

> 
> Also, Lester Cheung commented on my blog that he got the same behavior after
> unplugging a DVD drive... but in my case, there has never been anything on
> IDE1.  So, in regards to Kerry's comment... it SHOULD be an attachment failure,
> shouldn't it -- since nothing is plugged into it? But it shouldn't keep trying
> 1000s of times.

I need your help to narrow down who initiates the tryings. Pls refer to my
previous comments for the info I want. I don't know it's so convenient for you
to reboot the system :-(. Andy, could you help here?
Comment 14 Malachi de AElfweald 2009-08-11 13:14:55 UTC
Here is someone else reporting the same problem (and then referencing this bug)

http://opensolaris.org/jive/thread.jspa?threadID=108753
Comment 15 Kerry Shu 2009-08-11 15:02:29 UTC
If the warning messages are still seen after system boots up and is running, we
can use:
# dtrace -n ':::ata_detach:entry{stack();}'
to get the calling stack.
Comment 16 Malachi de AElfweald 2009-08-11 15:08:11 UTC
root@serveris:~# dtrace -n ':::ata_detach:entry{stack();}'
dtrace: invalid probe specifier :::ata_detach:entry{stack();}: Invalid library
dependency in /usr/lib/dtrace/iscsit.d: /usr/lib/dtrace/iscsi.d
Comment 17 Rich Burridge 2009-08-11 15:21:07 UTC
(In reply to comment #16)
> root@serveris:~# dtrace -n ':::ata_detach:entry{stack();}'
> dtrace: invalid probe specifier :::ata_detach:entry{stack();}: Invalid library
> dependency in /usr/lib/dtrace/iscsit.d: /usr/lib/dtrace/iscsi.d

I think you are hitting bug #10134 here:
http://defect.opensolaris.org/bz/show_bug.cgi?id=10134

See the comments there for a workaround.
Comment 18 Malachi de AElfweald 2009-08-11 15:59:15 UTC
root@serveris:~# dtrace -n ':::ata_detach:entry{stack();}'
dtrace: description ':::ata_detach:entry' matched 1 probe

no further output.  I'll run it in the background and cache it to a file for a
few days to see if it shows up
Comment 19 David Comay 2009-08-11 16:03:22 UTC
As Kerry pointed out, communication can take place even with this in a
TRACKEDINBUGSTER so I'm restoring that.
Comment 20 Andy 2009-08-18 21:19:53 UTC
Back from Holiday and wondering whether the reboot information is still
required?  I've run the dtrace command and get a similar lack of output.
Comment 21 Malachi de AElfweald 2009-08-18 21:44:23 UTC
Unfortunately, all I managed to get prior to the ssh connection closing on me
(I was no on site) was:

dtrace: description ':::ata_detach:entry' matched 1 probe

I forgot to run it again after I got home.
Comment 22 Kerry Shu 2009-08-18 22:11:47 UTC
(In reply to comment #20)
> Back from Holiday and wondering whether the reboot information is still
> required?  I've run the dtrace command and get a similar lack of output.

Yes, I still need the reboot info(see my comment 8). I suggested to use dtrace
just in case such ata message can also be seen after system is up:
  ide (ata) instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu
0

BTW, just found my typo in comment 13, I actually meant "inconvenient". :-)
Comment 23 Andy 2009-08-18 22:31:47 UTC
Ok I'll try and pull the reboot info in the next couple of days.  I'm also
going to disconnect the IDE0 (CDROM) as a quick and dirty test to see if it
stops the box probing IDE1 as well.
Comment 24 Malachi de AElfweald 2009-08-18 22:54:38 UTC
I was trying to find some more info to see if anyone else had a similar problem
and came across these:

Although Bug ID 6831562 is talking about a different issue, the logs seem to
show the same behavior as well.

This one also specifies the same errors (for pcplusmp instead of xVM_psm)
https://rebby.com/blog.php?detail=35
In their case they said they fixed it by adding this to /etc/system
set pcplusmp:apic_use_acpi=0



Also, looking at my attached prtconf, I saw this:
There was reference to: ide, instance #0
but not to an instance #1 (which it is trying to bind).
I did see this though:
            ide (driver not attached)
                Hardware properties:
                    name='compatibility-mode' type=int items=1
                        value=00000001
                    name='reg' type=int items=1
                        value=00000001




Then while trying to get all this data (running fmtopo, prtdiag, prtconf, etc),
I noticed that the dtrace thing finally kicked off!

malachi@serveris[0]:~ % cat ata_test.out 
dtrace: description ':::ata_detach:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1  65477                 ata_detach:entry 
              ata`ata_attach+0x1f3
              genunix`devi_attach+0x80
              genunix`attach_node+0x95
              genunix`i_ndi_config_node+0xa5
              genunix`i_ddi_attachchild+0x40
              genunix`devi_attach_node+0xac
              genunix`config_immediate_children+0xd5
              genunix`devi_config_common+0xa6
              genunix`mt_config_thread+0x53
              unix`thread_start+0x8

  0  65477                 ata_detach:entry 
              ata`ata_attach+0x1f3
              genunix`devi_attach+0x80
              genunix`attach_node+0x95
              genunix`i_ndi_config_node+0xa5
              genunix`i_ddi_attachchild+0x40
              genunix`devi_attach_node+0xac
              genunix`config_immediate_children+0xd5
              genunix`devi_config_common+0xa6
              genunix`mt_config_thread+0x53
              unix`thread_start+0x8

  1  65477                 ata_detach:entry 
              ata`ata_attach+0x1f3
              genunix`devi_attach+0x80
              genunix`attach_node+0x95
              genunix`i_ndi_config_node+0xa5
              genunix`i_ddi_attachchild+0x40
              genunix`devi_attach_node+0xac
              genunix`config_immediate_children+0xd5
              genunix`devi_config_common+0xa6
              genunix`mt_config_thread+0x53
              unix`thread_start+0x8
Comment 25 Malachi de AElfweald 2009-08-18 22:55:45 UTC
This is from the same time frame:


Aug 18 15:50:00 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 3
Aug 18 15:50:24 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 0
Aug 18 15:51:00 serveris xpv_psm: [ID 805372 kern.info] xVM_psm: ide (ata)
instance 1 irq 0xf vector 0x98 ioapic 0x4 intin 0xf is bound to cpu 1
Comment 26 Malachi de AElfweald 2009-08-18 22:56:59 UTC
This will consistently add another line into dmesg and the dtrace output:

pfexec /usr/lib/fm/fmd/fmtopo -V
Comment 27 Kerry Shu 2009-08-19 00:13:38 UTC
Great, thanks very much for the info!!!

Could you try further with this dtrace command?
# dtrace -n 'ata_detach:entry{stack(); printf("pid = %d",pid); ustack();}'

Also after we get the process id# above, we can figure out the application by
# ps -Aef |grep pid#
Comment 28 Malachi de AElfweald 2009-08-19 00:35:17 UTC
malachi@serveris[0]:~ % cat ata_test.out
dtrace: description 'ata_detach:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2  65477                 ata_detach:entry
              ata`ata_attach+0x1f3
              genunix`devi_attach+0x80
              genunix`attach_node+0x95
              genunix`i_ndi_config_node+0xa5
              genunix`i_ddi_attachchild+0x40
              genunix`devi_attach_node+0xac
              genunix`config_immediate_children+0xd5
              genunix`devi_config_common+0xa6
              genunix`mt_config_thread+0x53
              unix`thread_start+0x8
pid = 0
Comment 29 Malachi de AElfweald 2009-08-19 00:54:46 UTC
ppid is also 0
Comment 30 Malachi de AElfweald 2009-08-19 01:16:20 UTC
using $pid and $ppid instead gives the pid of the dtrace process and the shell
that launched it
Comment 31 Malachi de AElfweald 2009-08-19 01:57:18 UTC
Created an attachment (id=2455) [details]
dtrace of ata_detach + genunix

I am not sure if this is going to help, but since I couldn't track the PID I
was trying to track where the thread was being started from...

Used this script... bear with it, it's my first attempt at DTrace:

#!/usr/sbin/dtrace -s
:::ata_detach:entry
/pid != $pid/
{
    printf("Starting ata_detach\n");
    stack(); 
    ustack();
    printf("Ending ata_detach\n");
}

io:genunix::
{
    printf("Starting genunix\n");
    stack();
    ustack();
    printf("Ending genunix\n");
}
Comment 32 Kerry Shu 2009-08-19 23:59:52 UTC
Hi Malachi,

I'm going through Solaris code to try to find who initiates the 
mt_config_thread(). Currently, I suspect it's related to
devfsadm or DR. Could you help to do more experiments?

1. disable devfsadm daemon
# svcs -a|grep sysevent
# svcsadm disable svc:/system/sysevent:default
# svcs -a|grep sysevent

2. If you can still see the warning message after that, can you help to
log into console mode, so that we can enable kmdb?
# mdb -K
 [0]> ata_detach:b
 [0]> :c

when the breakpoint is hit, you'll get to kmdb again, then
pls check process info by below:
 [0]> ::stacks -c mt_config_fini
 [0]> [threadp]::print -at kthread_t t_procp
 [0]> [procp]::ps

For example:
 [0]> ::stacks -c mt_config_fini
THREAD           STATE    SOBJ                COUNT
ffffff00051bbc60 SLEEP    CV                      1
                 swtch+0x147
                 cv_wait+0x61
                 mt_config_fini+0x2f
                 config_grand_children+0x3d
                 devi_config_common+0xda
                 mt_config_thread+0x53
                 thread_start+8

ffffff014a4b6c20 SLEEP    CV                      1
                 swtch+0x147
                 cv_wait+0x61
                 mt_config_fini+0x2f
                 config_grand_children+0x3d
                 devi_config_common+0xda
                 ndi_devi_config_driver+0x1d
                 devinfo`di_ioctl+0x1e7
                 cdev_ioctl+0x45
                 specfs`spec_ioctl+0x5a
                 fop_ioctl+0x7b
                 ioctl+0x18e
                 _sys_sysenter_post_swapgs+0x149

[0]> ffffff00051bbc60::print -at kthread_t t_procp
ffffff00051bbde0 struct proc *t_procp = p0
[0]> p0::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R      0      0      0      0      0 0x00000001 fffffffffbc2e370 sched

[0]> ffffff014a4b6c20::print -at kthread_t t_procp
ffffff014a4b6da0 struct proc *t_procp = 0xffffff0147e04008
[0]> 0xffffff0147e04008::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R     81      1     81     81      0 0x42000000 ffffff0147e04008 devfsadm
[0]>

[0]> :z;:c
-> this will clear the breakpoint and continue the OS running.

Thanks,
Kerry
Comment 33 Malachi de AElfweald 2009-08-20 00:17:53 UTC
Hi Kerry,

I tried disabling it and it still showed up.  I went ahead and re-enabled it as
I didn't know what side effects it might have.

Since I am not currently onsite I will have to get back to you regarding the
mdb bit.
Comment 34 Malachi de AElfweald 2009-08-20 03:09:06 UTC
I tried doing 'mdb -K' but it complained and said I had to do 'mdb -KF'
so I tried that and it locked up the OS
so, had to reboot after all (unfortunately without being able to shutdown any
of the domains cleanly)

One of my grub options included '-kd -v -m verbose' so I used that and got into
kmdb

[0]> ata_detach:b
kmdb: failed to dereference symbol: unknown symbol name
[0]> :b ata`ata_detach
[0]> :c
that started booting and then asked me to go into maintenance mode
after entering maintenance mode, I thought I would try again so I did 'mdb -K'
I can't remember if it made be use the -KF that time or not
[0]> ata_detach:b
kmdb: failed to dereference symbol: unknown symbol name
[0]> :b ata`ata_detach
[0]> :c
kmdb: failed to arm breakpoint at 0: no mapping for address
kmdb: failed to arm breakpoint at 0: no mapping for address

I checked dmesg (including after running fmtopo) and no NEW failures were being
appended.... which tells me that whatever the problem is is not loaded by the
time I get to maintenance mode.  I assume it fails to set the breakpoint
because the ata module is not loaded yet.

I rebooted and am trying to get back into the system... so I can't try anything
else for at least 4 hours.
Comment 35 Malachi de AElfweald 2009-08-22 01:56:41 UTC
Special thanks to Kerry Shu for working extensively with me offline; it is now
fixed.  From sending email saying that I am rebooting to sending another saying
I was back up took less than 8 minutes.


#!/bin/bash
for snapshot in `zfs list -t snapshot | grep auto-snap | awk '{print $1}'`; do
  zfs destroy $snapshot
done
Comment 36 Kerry Shu 2009-08-22 03:51:22 UTC
That's great, Malachi, this confirms this issue is a dup of this ZFS issue:
  http://bugs.opensolaris.org/view_bug.do?bug_id=6761786