Outage Incident - 23/07/19 - Web App Impact
Incident Report for CyberSmart
Resolved
Issue Summary

- Total Outage time: ~2.5m hours
- All users were unable to access the CyberSmart Web platform due to a 3rd party component failure.
- All customers and application HTTP requests to the platform resulted in 502 errors
- A third party hosting/services company (Amazon Web Services) experienced an outage in which we have a number of key infrastructure components hosted with.

Timeline (GMT)

- 16:33 Issue Began
- 16:50 Staff were notified of the issue
- 19:00: Issue resolved (by external service provider)
- 19:03: CyberSmart platform back online

Root Cause

Amazon AWS had issues with a few of there platform infrastructure services including degraded performance for EBS volumes within the “EU-WEST-2”Region, which is a key part of the RDS component CyberSmart uses for data storage.

Resolution and recovery

N/A

Corrective and Preventative Measures

We have planned a work-stream for improved failover within CyberSmart, including using PaaS services distributed over different geographical regions. This will allow automatic corrective measures to keep our services online when a given region has issues.
Posted Aug 07, 2019 - 11:01 BST
This incident affected: CyberSmart Platform (CyberSmart Apps, CyberSmart Dashboard).