Service Level Agreements (SLAs) can be confusing in the cloud
On Nov. 19, 2014 the IT department of a Texas contracting company started
getting reports that the Microsoft Office 365 cloud-based email system was
unavailable to its employees. Users couldn’t get email on their phones or via
Outlook. As the day rolled on some users’ email came back, others didn’t. When
US workers signed off, international employees started reporting similar issues.
For some users, email was out for 24 hours.
After the outage IT leaders huddled and filed a claim with Microsoft for a
breach of the company’s service-level agreement (SLA), which guarantees that
Office and other Microsoft Online services will be available 99.9% of a given
month. If the service is available for less than that, a 25% credit can be
issued to customers. But the response they got from Microsoft surprised them:
Web access was still available so the service was not technically unavailable
and therefore it was not a breach of the SLA.
“The number of people willing, able and knowledgeable enough to use that option
is pretty low,” said a senior member of the IT staff, who requested anonymity so
he doesn’t sour his relationship with Microsoft. In response, the contracting
company has since educated employees on how to use web email access when Outlook
is down.
In response to a request for comment on the situation, Microsoft issued a
statement saying it strives for “an always available service” and that SLAs are
in place to provide financial reassurance to that commitment. If a Microsoft
online service is unavailable for less than 95% of a given month customers can
get a full statement credit for that period.
This episode, however, illustrates the need to understand all terms and
conditions in cloud SLAs. Enterprise agreements can be complicated so here are
10 things to watch out for when reviewing SLAs for Microsoft Office 365 (the
SaaS offering) and Microsoft Azure (which includes IaaS and PaaS components).
Many of the tips apply to other cloud platforms too, such as AWS, but they are
specifically for Microsoft cloud services. See Microsoft’s list of Azure IaaS
SLA uptime guarantees here; the online services SLA can be found here.
Read the contract and all the supporting documentation
This may seem obvious, but many people don’t actually read the contract,
just like they skim over End User License Agreements. “I run into an amazing
number of people who zip through a PowerPoint and then sign the contract,” says
Paul DeGroot, who works as a consultant at Pica Communications advising clients
on Microsoft licensing. If you don’t understand something in the contract after
analyzing it, ask for help. The key to understanding your SLA is reading it.
"I run into an amazing number of people who zip through a PowerPoint and then
sign the contract."
Paul DeGroot, consultant at Pica Communications
Contracts can be confusing though. DeGroot says sometimes relevant
information is in a supporting document. SLA parameters can be outlined in one
section of a document but the contract can be subject to terms that are defined
in other literature. Make sure to read the entire contract, including any
supporting documents.
SLA breaches must be reported
Some providers will automatically credit customers when there is an outage,
others will not. It is imperative that customers report any outages they believe
breach the SLA. DeGroot has run into instances where customers experienced a
multi-day outage and were sure their bill would simply reflect the event with a
credit. But if you don’t document and report it, you don’t have any way to prove
you experienced downtime. If you have a problem, record it, inform your provider
immediately and file a claim for the breach of an SLA.
Microsoft requires that customers submit an SLA breach claim to customer support
by the end of the calendar month after the event has happened. (So for example
if an incident happens in mid February, the customer has until the end of March
to report it.) The claim must include: a detailed description of the incident;
duration of incident; number of users or sites impacted; description of your
attempts to remedy the situation.
An SLA with 99.9% uptime still allows for 8 hours of downtime per year
Many of Microsoft’s services come with a 99.9% uptime guarantee
(three-nines). That sounds good. But being up for 99.9% of the year still allows
for 8 hours and 45 minutes of downtime each year with no breach of the SLA. How
would you feel if your workload is unavailable for 8 hours one day? This uptime
calculator can help users predict how much downtime they should expect from
their provider based on their SLA uptime guarantee.
Each service can have its own SLA
Each individual service can have its own SLA uptime guarantee. For example,
Microsoft Azure VMs have a 99.95% uptime guarantee (if deployed across two
Availability Sets; more on that later) and the SQL database has a 99.9% uptime
guarantee. Most Microsoft Online SaaS products come with a 99.9% uptime
guarantee too. But 99.9% uptime allows for up to 43 minutes of downtime to occur
in a month without breaching the SLA.
As Troy Hunt, a Microsoft expert blogger points out in this piece, those
downtime events do not have to occur at the same time for the provider’s SLA to
be intact. So, for example, if you have a system that relies on Azure VMs, a SQL
database and Azure storage, then on the first day of a month an Azure VM could
go down for 21 minutes and bring your workload down. The next day Azure SQL
could go down for another 42 minutes and bring the application down. Both of
those would still be within the terms of the SLA. For more on this, blogger
Brent Stineman explores how to calculate aggregate SLAs across multiple services
here.
VMs may need to be deployed across multiple instances for the SLA to kick in
One of the mantras of cloud computing is prepare for failure. And in fact
some cloud services, including Microsoft and AWS, mandate that customers
architect their systems to be prepared for failure to meet the terms of the SLA.
AWS, for example requires that virtual machines be deployed across multiple
Availability Zones (which are different data centers in AWS’s cloud) and both
copies of the VM must be unavailable for the SLA to be breached. Microsoft uses
the term Availability Sets instead of Availability Zones, but it’s the same
idea. Customers must heed the best-practice architectures to ensure their
systems comply with the terms of the SLA.
Migration to a healthy VM could cause downtime, which may not breach SLA
One thing to keep in mind is that if you architect your system to be fault
tolerant and to fail over to another VM or Availability Set, that action itself
could cause problems, such as a reboot. If your system goes down because it was
not set up to handle a migration to a new set of VMs then that failure is not
the provider’s fault and will not count as a breach of the SLA. Tools like
Netflix’s Simian Army Chaos Monkey and Chaos Gorilla can help AWS customers test
the tolerance of their systems to outages.
Is the service really unavailable and is it your vendor’s fault?
In the example of the Texas company above, IT staff believed the outage was
Microsoft’s fault, which it was. But the service wasn’t really unavailable
because web access was still an option, so it didn’t count against the SLA. So
if your app goes down, is it really your vendor’s fault? Is the service
unavailable from all access points? Similarly, sometimes cloud services go down
but it’s not the vendors fault. For Microsoft’s SLA to be breached the service
must be down because of “circumstances within Microsoft’s control,” the company
states. When an outage occurs, check to see if there is something on your end
that caused the outage. Is your network connection to the cloud good, for
example? Customers have to prove that their vendor was at fault and the service
was truly down in order to be compensated for an SLA breach. A helpful tool for
determining if your provider has had an outage are service health dashboards,
where Microsoft and AWS report which services have been unavailable.
Terms of service can change
The cloud is a fast-moving industry and offerings from providers can change.
When offerings change, so too can the SLAs. Typically SLAs will outline whether
a provider has to notify customers of a change to the service or SLA, or if
customers should be prepared for a service disruption. But, it can vary from
provider to provider and service to service whether customers will be informed
of changes. If a sudden change to a service would impact your workload, check to
ensure that your provider will notify you of such changes.
Microsoft will notify customers of what it calls “disruptive changes” to its
core products, notes Donald Retallack, a research vice president at Directions
on Microsoft, a consultancy. Microsoft defines “disruptive changes” as:
“change(s) where a customer or administrator is required to take action in order
to avoid significant degradation to the normal operation of the online service.”
Microsoft promises to inform customers six months in advance of a disruptive
change to its Dynamics CRM platform, for example. But other non-disruptive
changes can occur without Microsoft notifying customers.
Planned downtime does not always count against an SLA
It is one thing for a service to go down for an unexpected reason, but sometimes
the cloud can go down because the service providers take it down. Verizon, for
example, had an almost 48-hour planned outage earlier this year. Outages like
that can mean the service is down, but it doesn’t count against the SLA.
Customers can ask their provider to ensure they will be informed of any planned
downtime.
“Preview” or beta services may not come with an SLA
Many providers offer free-tiers of service or other products that are in
preview. Typically, those free and preview services are not covered by SLAs. So,
feel free to use them but make sure you understand the terms and the risks of
using them before relying on them for critical functions.