Stefan Fried | January 31, 2019 | 5 minutes read

Oh, no! Not another Office 365 outage! What’s your backup plan?

Last week several organizations using Office 365 across the world were impacted by an outage where the Exchange Online service was not fully available for several hours. Monitoring your Office 365 installation is a critical first step in getting the information you need on your enterprise applications in real time.

UPDATE! Further Outage on May 2nd!
UPDATE! Further Outage on January 29th!
Whoomp! There it is…
Take a Walk on the Safe Side
Ensure Solid Business Continuity for Your End-Users
UPDATE: Further Outage on January 29th!
UPDATE: O365 Outage on May 2nd!

UPDATE! Further Outage on May 2nd!

UPDATE! Further Outage on January 29th!

Last week several organizations across the world were impacted by an Office 365 outage. The Exchange Online service was not fully available for several hours. Some couldn’t access their mailboxes and for some, the mail delivery performance (sending/receiving) was just poor.

Whoomp! There it is…

The consequences are obvious. Loss of productivity, bad end-user experience, amplified end-user frustration, loss of business speed and loss of trust. And that’s just naming a few of the many possible business-critical impacts.

Interestingly enough, the case under which this incident was logged (EX172491) has been removed by Microsoft in the meantime.

We all know that there is a certain risk whenever cloud services are used. One key question always remains though – How well are you prepared if something like this happens to your organization? Or in other words – What is your backup plan?

Indeed, a fundamental question for many end users, administrators and businesses who rely on stable, high-performance cloud offerings on a daily basis.

Take a Walk on the Safe Side

Monitoring your Office 365 installation is a critical first step in getting the information you need on your enterprise applications in real time. You can’t effectively manage a vitally important part of your application infrastructure unless you know how it’s performing. Early insights on availability will help you prepare for outages.

Knowing who is impacted is an important element for steering the issue (e.g. notifying your end-users). Whether only a group of people, a subset of users (in case Multi-Geo capabilities of Office 365 are used) or the entire organization using the cloud tenant.

With OfficeExpert we offer a solution that helps you to identify the magnitude of the possible impact.

Furthermore, by using the Mail Flow Simulation Sensor by OfficeExpert, organizations could have seen that the system was somehow restored (accessing the mailbox worked again). They could have also seen that the underlying service of sending/receiving mails was still impaired by the incident though. The following screenshot shows that there was a steady increase in the mail delivery time between January 23rd and 26th.

Ensure Solid Business Continuity for Your End-Users

This transparency helps you know that a particular service is not fully restored. It also helps you understand how you can plan and communicate accordingly. At the end of the day, this naturally benefits the end user too.

Monitoring notifications ensures that you are the first to find out that an issue exists. Even before Microsoft tweets about it hours later. Knowing which services are affected allows you to work proactively by notifying your users and apply contingency plans before being inundated with user tickets.

Try Now

UPDATE: Further Outage on January 29th!

Another major outage happened on January 29th, 2019 where users were unable to authenticate and access Office 365 services. Azure was affected by this incident also. The root cause which was communicated by Microsoft was a DNS issue with CenturyLink as an internal DNS provider.

The following screenshot shows how OfficeExpert has seen and measured this outage. The Skype for Business Service had a downtime of almost 3 hours. Other services such as Exchange Online were impacted for around 1 hour. The failure indicator (error message in the screenshot) states that a certain full qualified domain name could not be resolved. This matches exactly with the root cause statement by Microsoft.

UPDATE: O365 Outage on May 2nd!

On May 2nd at 10:10pm CEST (1:10pm PST) Microsoft sent out the following message: We’re aware of and investigating an issue affecting access to SharePoint and OneDrive. Further details can be found in the admin center under SP178746 and OD178975.

At first Microsoft was unable to get any information out to its community. Users worldwide were forced to turn to social media rumor mills to find out why they were having problems. Core services that negatively affected productivity included Azure, multiple Microsoft 365 services, Dynamics, and DevOps.

In the screenshot below, it can be seen that OfficeExpert identified an outage at 9:50pm CEST. This was a full 20 minutes before the first Microsoft communication was sent.*

We’re very pleased with the positive feedback we received from our customers using OfficeExpert. They were able to identify the global outage of related sub-services for themselves before it was made public.

This was like the global Azure outage in January when it took over 1 hour for Office 365 services to be restored. Again raising the question how best to minimize the impact of cloud outages on your business.

You can read more about this topic in our white paper. German speakers may also be interested in listening to the webinar with MVP Michael Greth and Stefan Fried on what to do during a cloud outage.

* according to publicly available sources

Share

About the Author

Stefan Fried

Senior Program Manager & Senior Consultant

Stefan Fried has more than 16 years experience in collaboration solution environments.

Articles & Insights by the Author

TESTSERVER

Microsoft 365

HCL Notes/Domino

Consulting

Learning

Support

Discover

Connect

Author

Oh, no! Not another Office 365 outage! What’s your backup plan?

UPDATE! Further Outage on May 2nd!

UPDATE! Further Outage on January 29th!

Whoomp! There it is…

Take a Walk on the Safe Side

Ensure Solid Business Continuity for Your End-Users

UPDATE: Further Outage on January 29th!

UPDATE: O365 Outage on May 2nd!

About the Author

Stefan Fried

How Microsoft Teams Bots Help to Support Your Workforce

Key Metrics to Assess Microsoft Teams Performance and Call Quality

TrueDEM: A Paradigm Shift in Digital Experience Monitoring

OfficeExpert TrueDEM

Tabzilla

6 Steps for Your Personal Information Security Risk Assessment

SoftwareONE & panagenda: Working Together Towards Better End-User Experience Management

TrueDEM: A Paradigm Shift in Digital Experience Monitoring

White Paper: Digital Experience Monitoring and Troubleshooting for Microsoft Teams Call Quality

panagenda at NYSE Floor Talk

How a Fortune 100 Financial Services Company Improved UX

Oh, no! Not another Office 365 outage! What’s your backup plan?

UPDATE! Further Outage on May 2nd!

UPDATE! Further Outage on January 29th!

Whoomp! There it is…

Take a Walk on the Safe Side

Ensure Solid Business Continuity for Your End-Users

UPDATE: Further Outage on January 29th!

UPDATE: O365 Outage on May 2nd!

Share

About the Author

Stefan Fried

Related Articles & Insights

How Microsoft Teams Bots Help to Support Your Workforce

Key Metrics to Assess Microsoft Teams Performance and Call Quality

TrueDEM: A Paradigm Shift in Digital Experience Monitoring