Bug 4788 - VMWare Fusion Host: OSOL Guest: Time-of-day chip unresponsive; dead batteries?
: VMWare Fusion Host: OSOL Guest: Time-of-day chip unresponsive; dead batteries?
Status: CLOSED TRACKEDINBUGSTER
Product: opensolaris
kernel
: 200811
: VMWare/i386 Mac OS
: P2 major (vote)
: in-200811
Assigned To: Watcher account for OpenSolaris kernel bugs
:
:
: BugsterCR=6741572
:
:
:
  Show dependency treegraph
 
Reported: 2008-11-10 08:28 UTC by Peter Cudhea
Modified: 2009-04-03 10:50 UTC (History)
12 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Peter Cudhea 2008-11-10 08:28:34 UTC
MacBook Pro (10.5.5) running VMWare Fusion 2.0 (as host).  OpenSolaris 2008.05
or 2008.11 as guest.

I have had this problem with both 2008.05 and 2008.11 (101a rc1b).  
OpenSolaris comes up thinking the time of day is December 26, 1986.   In the
/var/adm/messages file there is a line:
WARNING: Time-of-day chip unresponsive; dead batteries?

I got this failure today even on a brand new virtual machine.   Not clear if it
is a VMWare or an OpenSolaris issue.   I haven't seen other people discussing
it for example on indiana-discuss, though.

This makes it impossible for me to use OpenSolaris as my default work
environment on my Macbook Pro.
Comment 1 David Comay 2008-11-10 17:23:49 UTC
I believe this is a VMware issue.  I've seen it from time to time but it seems
inconsistent.

Can the Reporter set their clock back to the present time or does it keep going
back to the epoch?
Comment 2 Ross 2008-12-03 02:18:22 UTC
This happens for me on a full VMware ESX 3.5 server, and after googling, many
other people seem to have seen this too.

The general consensus is that Solaris 10U3 and 10U5 work fine, as does
OpenSolaris 2008.05.  However, builds from snv_91 onwards have been reported as
showing this problem:

http://forums.sun.com/thread.jspa?threadID=5310441
http://jp.opensolaris.org/jive/thread.jspa?threadID=70642&tstart=0
Comment 3 David Comay 2008-12-03 23:25:29 UTC
*** Bug 5358 has been marked as a duplicate of this bug. ***
Comment 4 David Comay 2008-12-03 23:25:38 UTC
*** Bug 5587 has been marked as a duplicate of this bug. ***
Comment 5 Ross 2008-12-04 07:29:31 UTC
I should maybe have said that this is quite a problem if you're running
OpenSolaris as a storage server.

We're using the CIFS service to host files to our domain from an OpenSolaris
server running within vmware ESX.  However because of this time clock problem,
Solaris gets the wrong time every time it reboots, which breaks Kerberos
authentication and renders the files inaccessible.

We have to manually connect to the server, issue a ntpdate command and restart
the services to get it running again, as mentioned here:
http://www.opensolaris.org/jive/thread.jspa?threadID=80139&tstart=0
Comment 6 David Comay 2008-12-08 09:01:03 UTC
*** Bug 5690 has been marked as a duplicate of this bug. ***
Comment 7 Peter Cudhea 2008-12-08 09:11:52 UTC
(In reply to comment #1)
> I believe this is a VMware issue.  I've seen it from time to time but it seems
> inconsistent.
> 
> Can the Reporter set their clock back to the present time or does it keep going
> back to the epoch?

No, I cannot persistently set the clock back to the present time.   I can
manually set the time back to the present time, yet each time I reboot it is
back to Dec 27, 1969.   Not exactly the epoch, but pretty close.


By the way, David, Bug 5358 does not look to me like a duplicate of this bug:
a) it refers to virtual box, not VMware
b) Time is off by minutes not by decades.

Could you clarify why you think the two bugs are related?

Is anyone looking into this issue?
Comment 8 Moinak Ghosh 2008-12-08 10:15:03 UTC
In my case the time gets reset to epoch on every reboot. UFS supplies a last
mounted time which can be used to initialize the TOD clock in case TOD can be
synced to the machine. Afaik ZFS does not provide this, so clock gets reset.
Comment 9 David Comay 2008-12-08 11:24:35 UTC
Peter, I believe Rob is using VMware in 5358 - there's just an additional
comment about VirtualBox in the bug report but the Summary and initial comment
mention VMware.

I'll work on getting someone assigned to look at this.
Comment 10 David Comay 2008-12-11 18:34:17 UTC
My suspicion here is that starting in build 88, zfs_mountroot() began calling
clkset(-1) and that's reacting badly when being a guest under VMware.
Comment 11 Dan Mick 2008-12-12 23:13:38 UTC
I've tested under VMWare Server on Ubuntu, and see the problem every boot.

The problem is that Solaris believes it cannot read the RTC; reading CMOS
address 0xd (register D) returns 0, which does not have RTC_VRT set, which
means the virtual RTC didn't respond "I have a valid RAM and time".

I don't know why this would have changed recently.  There have been changes
to the TOD support, but none that seem as though they would affect this
basic functionality.
Comment 12 Rudolf Kutina 2008-12-12 23:27:05 UTC
During "bootadm update archive -v" I see , missing /etc/rtc_config ?
Comment 13 Dan Mick 2008-12-16 17:00:12 UTC
So, somewhere between VMWare Workstation 5.5.8 and VMWare Server 6.5, this
behavior changed: the CMOS Register D, which has bit 7 which says "the RTC
is valid and working", now reads as 0, which makes Solaris give up on the RTC.

The code in Solaris that validates that register is old, and the behavior
has clearly changed in VMWare.  Presumably other OSes don't demand that bit
be set.  So, a workaround could be coded, but I'd like to try to ask VMWare
if they know about this issue first.
Comment 14 Dan Mick 2008-12-16 18:11:50 UTC
I'm using VMware Server 2.0.0, not 6.5.
Comment 15 Dan Mick 2008-12-16 22:41:38 UTC
Experimenting, patching the 'unix' binary so that the absence of bit 7
is ignored makes the kernel come up without complaint, and the time is correct.
So this is merely a missing bit in the RTC emulation from VMware, apparently
introduced somewhat recently (later than Workstation 5.5.8, anyway).
Comment 16 Andrei Dorofeev 2008-12-17 22:38:59 UTC
(In reply to comment #13)
> So, somewhere between VMWare Workstation 5.5.8 and VMWare Server 6.5, this
> behavior changed: the CMOS Register D, which has bit 7 which says "the RTC
> is valid and working", now reads as 0, which makes Solaris give up on the RTC.
> 
> The code in Solaris that validates that register is old, and the behavior
> has clearly changed in VMWare.  Presumably other OSes don't demand that bit
> be set.  So, a workaround could be coded, but I'd like to try to ask VMWare
> if they know about this issue first.

Yes, we already know about this issue and vmm patch is in the works.  Stay
tuned.
Comment 17 Dan Mick 2008-12-18 16:08:39 UTC
Is it known which versions of the various VMs are affected?
Comment 18 Andrei Dorofeev 2008-12-18 16:25:18 UTC
VMware is planning to fix this bug in next Fusion release -- v2.0.2.
For ESX, it will be fixed in the next version of ESX, and patched for ESX3.5.
For Workstation, the fix is planned for WS6.5.1.
Comment 19 David Comay 2008-12-18 17:27:29 UTC
Andrei, thanks for the update - it's much appreciated.

Is there any known workaround on the VMware-side that's available?
Comment 20 Andrei Dorofeev 2008-12-18 17:41:19 UTC
I am afraid that there is none.
Comment 21 Rudolf Kutina 2008-12-19 00:42:26 UTC
1. I found in some blog which reffers then deleting nvram file before startng
VM sometimes help for opensolaris.

I try it one and it works.

2. Sometimes after clear shutdown of OpenSolaris and restart I see a VMware
PLayer 2.5.1 message them nvram is corupted and default valuaes are forced
instead ?

My feeling is then this situation happen when OpenSolaris stops monitoring time
of day because clock was changed in large step (Yes, no VMware tools installed)
Comment 22 David Comay 2008-12-19 10:44:12 UTC
Andrei, is there a "bugid" or VMware equivalent that's being used to track
this?
Comment 23 Andrei Dorofeev 2008-12-19 12:04:40 UTC
(In reply to comment #22)
> Andrei, is there a "bugid" or VMware equivalent that's being used to track
> this?

ESX: 358798
Fusion: 357796
Workstation: 354249
Comment 24 Tim Mann 2008-12-19 17:18:53 UTC
(In reply to comment #13)
> So, somewhere between VMWare Workstation 5.5.8 and VMWare Server 6.5, this
> behavior changed: the CMOS Register D, which has bit 7 which says "the RTC
> is valid and working", now reads as 0, which makes Solaris give up on the RTC.
> 
> The code in Solaris that validates that register is old, and the behavior
> has clearly changed in VMWare.  Presumably other OSes don't demand that bit
> be set.  So, a workaround could be coded, but I'd like to try to ask VMWare
> if they know about this issue first.

The VMware behavior hasn't changed; it's just that the bug is a bit more subtle
than what you describe above.

The VMware behavior has always been that register D bit 7 is initialized to 1,
but if the guest writes to this register, all 8 bits are written, so the guest
can change that 1 to a zero.  We were unaware until recently that this behavior
is wrong and bit 7 should be hardwired to read back as 1 no matter what.  We've
fixed that internally and are trying to get it backported to the various
branches that we still periodically release patches for.

At some point between Solaris 10 and the current version, Solaris changed. 
Previously it did not write to register D, but now it writes a 0 there.  That
tickled the VMware bug.

The write to register D comes about because the low-order bits of the register
are (at least on some systems, VMware's included) the optional day-of-month
match field for the CMOS alarm clock interrupt.  The ACPI FADT table gives the
address of this field.  At some point Solaris must have been educated about
this optional field.  Solaris 10 wrote the hour, minute, and second alarm
fields, but not the day field.
Comment 25 Tim Mann 2008-12-19 17:30:34 UTC
p.s.  To clarify one point: when I said CMOS register D bit 7 is "initialized"
to 1, that initialization occurs when the VM's nvram file (which holds the
persistent CMOS contents) is created.  The nvram file is created and populated
with default values the first time you power on a new VM.  Thereafter, changes
made to the CMOS by the guest are persisted by being written back to the file. 
However, you can power off a VM, delete its nvram file, and then power it on,
and that will cause a fresh nvram to be created with default values again.
Comment 26 Dan Mick 2008-12-19 17:54:26 UTC
Awesome explanation, Tim, thanks...and yes, Solaris has relatively-recently
started writing the alarm field (which, as you say, uses CMOS offsets out of
the FADT, and not directly, which is why I missed those updates at first)
for suspend/resume and time-based wakeup.  The low-level worker routines 
always write all the fields, whether a wakeup timer is set or not.
Doubtless that's what tickled the bug.
Comment 27 Andrei Dorofeev 2008-12-19 22:07:01 UTC
(In reply to comment #23)
> ESX: 358798
> Fusion: 357796
> Workstation: 354249

Server 2.0: 359176
Comment 28 Rob_C 2009-02-09 19:22:15 UTC
(In reply to comment #19)
> Andrei, thanks for the update - it's much appreciated.
> 
> Is there any known workaround on the VMware-side that's available?

I posted Bug 5358 and it was duped to here ...

The "WORKaround" for 6.0.5 is to:

1. Right-Click your Windows Clock and get the Address of your Time Server.
2. Open [System][Admin][T&D] on OS 2009.06 and set the Time Server.
3. Disable the Time Server, then one-time sync.
4. Enable the Time Server and close the Pane.
5. Go back to WinXP and CTRL-ALT-DEL to activate Windows Task Manager.
6. Choose the [Processes] Tab and click on the "CPU" Header to sort
  by 'CPU Usage'. 
7. Locate vmware-vmx.exe and vmware.exe and set them to different to
  different Processor Affinity (it you have Multi-Core).
8. Now your time should be "better" (looses a couple of minutes in
  several hours. That is unacceptable, but much better.

Rob
Comment 29 Tim Mann 2009-02-10 12:16:21 UTC
Hmm, looking at bug 5358, I don't claim to understand it, but it doesn't look
anything like bug 4788.  Certainly the workaround offered for 5358 isn't
relevant for 4788.
Comment 30 David Comay 2009-02-11 10:10:47 UTC
Tim, the reason it was closed as a duplicate was because the system in question
had booted up with VMware and clearly the time-of-day was incorrect at the
start of the installation.  It seems very likely this is due to this bug - at
least, that's what I've seen with installs under Fusion.  But perhaps I
misunderstood what the Reporter was intending with bug 5358.

Rob, could you clarify the issue with bug 5358?  Do you see this issue when
VMware is not being used?
Comment 31 jeff 2009-02-24 14:12:38 UTC
i have encountered this clock problem in opensolaris 2008.11 running on vmware
3.5.  Are there any updates regarding the fix?  I understand a vmware patch
will be available at some point.
Comment 32 Dan Mick 2009-02-27 01:25:02 UTC
I've filed an RTI for the Solaris fix/workaround this morning, and hope to put
back soon for inclusion into snv_110/The Next Opensolaris Release.
Comment 33 Ross 2009-02-27 01:51:40 UTC
Great news, thanks for the update.

Good to know that we can rely on Sun to get these things fixed, even when
VMware are dragging their feet :-)

Is snv_110 still due around early March?  Am I right in thinking we should be
able to download the ISO's mid March time?
Comment 34 Andrei Dorofeev 2009-02-27 10:39:05 UTC
(In reply to comment #33)
> Good to know that we can rely on Sun to get these things fixed, even when
> VMware are dragging their feet :-)

This bug was fixed in ESX3.5 codebase, and I'm trying to find out for you
if the patch was released yet.  Will update this bug as soon as I have that
info.  This bug was fixed in Fusion 2.0.2 which is available for download now.
Comment 35 Dan Mick 2009-02-27 19:15:43 UTC
I certainly don't mean to say that VMWare is responding slowly at all.  It just
struck me that fixing it from both ends still makes sense, and there are people
who will benefit immediately from this fix that will have less trouble getting
the new Solaris than getting the new VMware, plus it enables people who aren't
aware of the problem to never become aware. :)

As far as exactly when this will hit the streets: don't know the schedules, but
I have in fact put it into snv_110.  Mid-March probably isn't far off.
Comment 36 Andrei Dorofeev 2009-03-02 12:59:05 UTC
(In reply to comment #34)
> This bug was fixed in ESX3.5 codebase, and I'm trying to find out for you
> if the patch was released yet.  Will update this bug as soon as I have that
> info.  This bug was fixed in Fusion 2.0.2 which is available for download now.

This issue will be fixed in ESX3.5 U4.
Comment 37 Ross 2009-03-09 04:51:03 UTC
I don't suppose anybody has a rough eta for when VMware are likely to be
releasing 3.5 U4?
Comment 38 Richard Huddleston 2009-03-27 13:46:07 UTC
(In reply to comment #35)
> I certainly don't mean to say that VMWare is responding slowly at all.  It just
> struck me that fixing it from both ends still makes sense, and there are people
> who will benefit immediately from this fix that will have less trouble getting
> the new Solaris than getting the new VMware, plus it enables people who aren't
> aware of the problem to never become aware. :)
> 
> As far as exactly when this will hit the streets: don't know the schedules, but
> I have in fact put it into snv_110.  Mid-March probably isn't far off.

I was experiencing this bug on opensolaris running in esx 3.5 update 3.  I
installed the opensolaris svn 110 build and no longer see this bug, but am now
seeing issues with HAL not starting:
[ Mar 27 16:08:45 Executing start method ("/lib/svc/method/svc-hal start"). ]
hal failed to start: error 2
Perhaps it has something to do with bug: 6792302
http://bugs.opensolaris.org/view_bug.do?bug_id=6792302
if I ssh in, and do a svc clean restart on hal, it starts up immediately, and
the graphical display starts up and I can login.
Could my new issue be related to the bug fix for this bug?
Comment 39 Jürgen Keil 2009-03-31 12:31:05 UTC
*** Bug 7815 has been marked as a duplicate of this bug. ***
Comment 40 Rudolf Kutina 2009-04-01 02:31:55 UTC
VMware just release ESX/ESXi Update4, can we check then this issues is fixed
there and it't works for stock CD installed OpenSolaris 2008.11 ?

Fix on ESX/ESXi platform is critical for us !!!
Comment 41 Jürgen Keil 2009-04-01 06:15:29 UTC
(In reply to comment #38)
> I installed the opensolaris svn 110 build and no longer see this bug, but am now
> seeing issues with HAL not starting:
> [ Mar 27 16:08:45 Executing start method ("/lib/svc/method/svc-hal start"). ]
> hal failed to start: error 2
> Perhaps it has something to do with bug: 6792302
> http://bugs.opensolaris.org/view_bug.do?bug_id=6792302

Do you have ntp enabled for your opensolaris snv 110 guest?

I think that the root cause for 6792302 will be fixed in 
build 111, see bug 6811294 "APIs like nanosleep() wakeup prematurely
when system time is changed".
Comment 42 Miroslav Osladil 2009-04-02 05:59:54 UTC
(In reply to comment #40)
ESX/ESXi 3.5.0 Update4 fixed my problem reported in duplicate bug 7815 with no
changes in Guest system.
Comment 43 Rudolf Kutina 2009-04-03 04:46:07 UTC
Looks like issue is fixed in all VMware products NEW refresh, they share same
updated Virt Core:

In ESX 3.5 and ESXi Update 4 (build 153875)
VMware Workstation Version: 6.5.2 | 156735 - 03/31/09
VMware Server 2 Version 2.0.1 | 156745 - 03/31/09
VMware Fusion 2 for Mac Version 2.0.3 | 156731 - 04/02/09

I check it on VMware Workstation Version: 6.5.2 and I see corect time after
reboot.
Comment 44 Andrei Dorofeev 2009-04-03 10:50:27 UTC
Closing this bug since this is now fixed in all VMware products.