Login Main site Create account

19.03.2008 00:23

VMware Update Manager - The good, the bad, the ugly

Today I wanted to try out VMwares shiny new Update Manager for VI3 (ESX 3.5, Virtual Center 2.5) because I heard so many cool things about it on VMworld. Unfortunately, my expectations to this product were much higher than what I got out of it. Let me summarize what they should improve in the next version.

I have three ESX hosts, one "power" machine (dual core, each CPU has about 3GHz, 16GB RAM) and two smaller machines of equal size (dual core, each CPU has about 3GHz, 8GB ram). Previously I always had to manually hot migrate (using VMotion) my VMs from the ESX host to be updated to the other ESX hosts which will be upgraded later on in this process. After applying all the updates, I had to manually move them back, update the next server, etc. VMware Update Manager claims to do exactly all that for your virtual infrastructure making updates a lot easier because they can even be scheduled at times when the server load is not that high or at the weekends, etc.
Well, thanks god I didn't try that schedule thing but was in front of my Virtual Center when trying Update Manager.
To start the whole update procedure, I clicked on "Remediate..." on my ESX Cluster and Update Manager picked one of my ESX hosts (oddly it didn't pick the one with the lowest load), started opening firewall ports, installing some stuff and finally wanted to put the host into maintenance mode. Due to me having the DRS automation level set to "partially automated" (I'm paranoid, you know - I don't even trust my virtual infrastructure), entering the maintenance mode would have timed out because the DRS migrations to move all the VMs off of the ESX host that currently gets updated were just recommendations. I needed to manually apply the generated recommendations and then it started to migrate the VMs away.

My VMotion network is currently only connected via 100MBit/s which I know is not recommended by VMware but it works (migrations take longer, but that doesn't bother me that much), _BUT_ because migrations take longer, the "put $esxhost into maintenance mode" task times out and what's even much worse: The parent task of this update process (called "Remediate Entity") stalls at a certain percentage level and stops working. You can't cancel it, you can't restart it, in fact, trying to start a new remediation only makes things worse.
Another thing that isn't very smart would be the automatically generated DRS recommendations. At the time Update Manager tries to get one ESX host out of duty it scans the cluster for available resources and in my case, having two additional ESX servers with average (low) load, it only chose _ONE_ of them to host the VMs to be migrated. Bad idea. During the migration, the load on ESX host A started to increase and DRS moves machines from ESX A away to ESX B to "balance average CPU loads", as it said... Well, what about generating new DRS recommendations after having migrated two VMs off the target ESX server? Things might have changed after that... Never thought about that? Don't worry, I already know.

Anyway, what helped to get the Update Manager processes disappear was to manually kill the update-manager.exe service on the Virtual Center server (stopping the service also timed out), wait a few seconds, start it again and wait for Virtual Center to reintialize the Update Manager extension. If it doesn't, set it manually to "Enable" again and all the previous, stalled, tasks should now have been quit with "VMware Update Manager had a failure". That's good, because now you can start over with patching your VI.

After having all machines manually migrated off my target ESX host, I put it manually into maintenance mode and started a new remediation on cluster level to see if it would be so clever to choose the host already in maintenance mode but it didn't (OK, that might be a good idea; you never know _WHY_ this host is currently in maintenance mode). What puzzled me was VMware Update Manager's overestimation of its capabilities, because one of my hosts was put to maintenance mode (therefore it wasn't an active part of my cluster anymore) and the other two ESX hosts hosted all my virtual machines being under quite some load trying to cope with that. But not enough, it tried to consolidate the two remaining ESX servers to get one free for applying updates onto it.
VMware, could you please ask me if I really want to do this? Doing this causes my whole VI to simply stop working because one ESX host can't handle this load. It's simply ridiculous to start the update process when resources are that low...

So, after my first date with VMware Update Manager I decided to not trust it as much as I would have liked to.

What worked for me was to manually (!!) migrate all the VMs of one of my ESX servers onto the other two ESX servers (I used a very, very complex algorithm to find out what VMs to move onto what ESX server to "balance the average CPU loads") and afterwards started the remediation of the critical updates on the empty ESX server.

During the time I'm writing this I'm currently giving Update Manager a second chance to prove that it could be my friend. To make stuff easier, I changed the DRS automation level to "fully automated" and bingo, that worked now. Update Manager was able to put the host into maintenance mode and it did a fairly good job in migrating the VMs to the other hosts. It is currently installing the updates and maybe afterwards I'll do some tests on VMware HA (isolation and stuff seems interesting...).

Long story short: Do extensive testing on VMware Update Manager before letting it do its work unattended.
Comments added earlier to http://tuxx-home.at/archives/2008/03/19/T00_23_10/index.html:
Guest on 2008-03-28 21:21:32 wrote:
I ran into very similar issues and ended up manually migrating my hosts as well. I could get the VMWare Update Manager to work without doing this. At least its still easier than doing the updates manually... <a href="http://universitytechnology.blogspot.com/2008/03/vmware-update-manager-esx-host-upgrade.html">VMWare Update Manager - ESX Host upgrade</a>
Guest on 2008-06-10 20:05:01 wrote:
I have used update manager since the first patch of 3.5. and in my experience it works well if I use it to patch individual ESX hosts one at a time in a cluster, not the whole cluster at once. The only time i've updated a cluster was on a set of hosts that did not have any guests which worked fine but I could see there would be a problem with trying to remediate a cluster. Recently because I was busy I did not patch a set of VMware patches and waited until the next set of patches to apply via update manager. What I discovered is that update manager cant handle multiple patch levels and suspends at a certain percentage level. I had to stop and restart the Update Manager service to get it to stop the update.

A work around I found was to start out by putting the ESX host into Maintenance Mode manually and letting a Fully Auto DRS handle the VMotions off the host Since I have the cluster DRS set to fully auto. The guests were VMotioned off without issue and no guests were VMotioned back to the host ( this can be a problem if you have DRS in full auto and try to manually VMotion guests off a host) . Once the host is cleared of guests I scan the HOST for updates and remediate the host. It runs the first set of patches and tries to reboot the ESX server. for some reason the VC loses connection with the ESX host and the Update manager task errors out. I manually login into the ESX host and reboot. Once the host is backup I run rescan for updates and run remediate again. The second application of patches worked without any issues and rebooted the host. After the Host is backup it is still in maintenance mode. I confirm the patch level and exit the host from maintenance mode and let DRS decide what guests to Vmotion. I have not tried any automated patching but that will be my aim with the next set of patches.
Guest on 2008-10-27 12:57:26 wrote:
Ona related note, we had a problem today which appears to have been caused by VUM: None of the VMs that were currently powered off would power on, failing with an 'unable to access a file since it is locked' message. the VM log file seemed to refer to the VMDK disk file.

Initial thoughts were a licence issue, but occasional VMs could be powered on so probably not, also checked to make sure disks hadn't got connected to the wrong VM etc. in the end it came down to the VUM service on the VI Center Server being in 'stopping' state. basically it had hung up after being restarted. we restarted the VI Center server machine (had to power cycle) (although possibly just killing the VUM service process would have done it). After that we were able to power up VMs as normal. A bit worrying...

Your comment (HTML tags will be stripped !!):

To verify You are not a bot, type down text from this image.

Your try: