How Azure uses machine learning to predict VM failures
One of the big advantages of the cloud is that you don't have to worry about managing hardware - or fixing it when it goes wrong, because hard drives and servers fail. In fact, hard drives are the most likely thing to fail in a cloud data centre: the question is not if, but when. Depending on which study you look at, it's anything from 20 percent of hard drives in storage systems reporting sector errors within two years, to 57 percent failing over six years. On a cloud service like Azure, that comes out to around 300 drives out of every million that could become faulty every day.
Storage clusters use hardware redundancy to avoid the problem, but for a server that's running virtual machines, a hard drive failing can't be worked around. In fact, the timeouts, volume size, sector and latency errors from a drive that's becoming unreliable can be just as bad as full failures because they create intermittent problems that are hard to diagnose - like file operations failing and VMs that don't respond - before the system eventually fails completely. Those kind of underlying faults turn out to be responsible for a lot of major cloud outages, when they cause some critical service to become unreliable at just the wrong moment.
Azure automatically live-migrates VMs when hardware fails, and also moves workloads before rack maintenance, BIOS updates, and any upgrades to Windows Server than take longer than hot-patching (which pauses the VM for up to 15 seconds). This halves the time that VMs are unavailable after a failure.
Even better, new machine learning systems that predict when hard drives or entire cluster nodes are going to fail - whether that's drive failures, I/O latency issues, memory errors or CPU frequency issues - now make sure no new VMs are deployed onto that hardware, and live-migrate VMs before the failure happens. That avoids about a thousand hours of downtime a month for Azure VMs.
Smarter than SMART
Predicting failures is actually harder when only a few devices fail, because there's a very low probability of any specific drive being the one that fails - and too many false positives makes Azure expensive to run, because hardware that's not failing would be out of use.
The Cloud Disk Error Forecasting system that Azure uses (built using Cosmos DB and AzureML) combines both the standard SMART drive monitoring data and system events from Windows that suggest there's a problem with the disk like paging and file system errors, problems collecting logs, dropped requests and unresponsive VMs. There are about 450 different pieces of data that might be relevant, but not everything that you expect to be helpful turns out to help the prediction: seek times don't help you sport failing hard drives, but if the number of reallocated sectors keeps going up, the drive is faulty.
On average, disk errors start showing up between 15 and 16 days before a drive fails, and in the last 7 days before it fails reallocated sectors triple and device resets go up tenfold.
Behaviour and failure patterns vary from one drive manufacturer to another, and even between different models of hard drive from the same vendor. The telemetry for training the machine learning system has to be collected from different kinds of workloads, because that affects how quickly the failure is going to happen: if the VM is thrashing the disk, a drive with early signs of failure will fail fairly quickly, whereas the same drive in a server with a less disk-intensive workload could carry on working for weeks or months.
SEE: Google Cloud Platform: An insider's guide (free PDF) (TechRepublic)
Azure has a similar machine-learning system that predicts failures of compute nodes. In both cases, instead of trying to definitively predict whether a specific piece of hardware is failing, the systems rank them in order of how error-prone they are (and penalises false positives three times as much as false negatives because of the potential disruption involved in an unnecessary live migration). The top systems on the list stop accepting new VMs and have running VMs live-migrated off onto different nodes, and then get taken out of service for testing.
Reacting to failure predictions
For most VMs, live migration won't affect the workload. Before migration starts, the orchestrator picks the best node to migrate to, exports the configuration of the VM and sets up the authorisation. The 'brownout' stage copies the entire VM to the new node over a few minutes, including the memory and disk state and network connections. That can take between one and 30 minutes, depending on the size of the VM and how quickly the information in memory is changing. Once the brownout finishes, the VM is suspended on both the original and new node, while the live migration agent copies any state information that didn't make it across already. This 'blackout' phase also depends on how much state needs to be copied, but it usually only takes a few seconds.
If your workload is very performance intensive, there might be some performance impact during the 'brownout' while the copying is going on, and there are some applications that can't cope with even the few seconds of interruption, while others can't be live migrated and have to be automatically redeployed. Specialised machine types like HPC, memory-optimised, GPU-optimised and storage-optimized instances, or the extremely cheap A series VMs - that run on the oldest servers in Azure - can't be live migrated.
If your workload can't cope with any interruption at all, you might want to refactor it and use a PaaS service rather than a VM for the critical piece. If you don't want to make changes, or you use one of the specialised instances, use the Scheduled Events service to get a notification that either maintenance or predicted failure is going to mean your VM getting live migrated (it also warns you if one of the cheaper low-priority VMs in your scale set is going to get evicted to make way for a higher-priority VM).
Scheduled Events tells you whether your VM is going to be paused, redeployed (losing ephemeral disks) or deleted because of priority. You also get notifications for reboots that you schedule yourself.
Low-priority VMs are cheap because they can be deleted when higher-priority tasks come along, so you might not get much notice (the minimum is 30 seconds) - but you get at least ten minutes warning for redeployments and at least 15 minutes for pauses and reboots. If the live migration or redeployment is happening because of a predicted failure, you might well get several days' notice before the failure happens and the service will try to delay the failure in various ways - although obviously, as it's a prediction, there are no guarantees when the failure will actually happen.
SEE: Windows 10 security: A guide for business leaders (Tech Pro Research)
Take the example of one drive that the forecasting system predicted had a very high probability of failing, which would take down five VMs running on the node. Because the probability was so high, live migration started eleven minutes after the prediction was made and blackout times for the five VMs ranged from 0.1 to 1.6 seconds. The Azure team took the node out of service for testing, including a disk stress test - which it failed 4 hours and 21 minutes after the first warning.
If the hardware on one of the nodes you're using triggers a Scheduled Event notification, the event will include when the hardware was detected as expected to fail and the 'not before' time after which the VM will be moved (assuming the hardware doesn't fail in the meantime). That might change as Azure detects more worrying signals from the node.
You can take control yourself and choose to checkpoint the VM ready to be restored, drain connections, fail over, take it out of your load balancer pool, or follow whatever process you have set up to get your workload ready to shut down. That should be automated, because the events can easily come in the middle of the night. Once the preparation is done, you can approve the event and Azure will run the live migration as soon as possible to get you off the degraded hardware.
Even if you can't tweak your VM so live migration isn't a problem, you can use the event to schedule a snapshot or route less traffic to the VM around the planned time so you can get enough control to take advantage of machine learning predictions for more performance-sensitive workloads.