incident prevention & resolution

Preventing and resolving malfunctions in the platform

platform

Designing, building and continuously improving the cloud platform (including landing zones, IAM, connectivity and integration) for the sole purpose of serving the business

capability

platform operations

Enabling the business to increase innovation in a controlled way by providing a platform so the operational processes are automated as much as possible

Overview

“A self-healing foundation is a joy for everyone”

Monitoring the health of solutions and services and the occurrence of abnormal events plays a crucial role in preventing and resolving incidents. Events that are regarded as atypical (anomalies) are detected, with or without the use of artificial intelligence techniques. The aim is to automate preventive and corrective actions resulting in a virtually “opsless” platform. A platform that, once it has been build, requires almost no manual interference from engineers.

The obvious goal is to provide the DevOps teams with a platform that is always available, and which helps them in building solutions that are resilient. Examples of how to achieve this are automatic root cause analysis, self-healing properties, immutable infrastructure and state control. Using meta data provided by the CSP, systems can be designed to scale automatically to be reprovisioned automatically in case of failure, and if needed, in another region in the event of a full-blown disaster.

Solution specific monitoring, logging, event detection and handling is decentralised to the responsible DevOps teams, but these teams will benefit from a resilient platform (and shared services) that is always available, and which helps them in building solutions that are tolerant to failures

The CSP’s offerings should be consumed as services rather than infrastructure building blocks, as these services will have self-healing properties already built in.

Activities checklist

Initial:

Defining an approach with regards to achieving a self-healing platform and shared services
Conducting initial tests to determine the tolerance of the platform to faults
Adding weaknesses to the backlog and remediate them
Implementing a process that governs the continuous improvement of the platform in terms of fault tolerance

Recurring:

Supporting DevOps teams designing for self-healing
Supporting DevOps teams in configuring/using tools and techniques
Automating resolving issues
Prioritising issues that lead to constructive improvements
Frequently assessing CSP’s new managed services to determine potential improvement with regards to fault tolerance
Reviewing every incident and aim to implement an improvement to the design of the platform

RASCI

cloud consultant		transformation consultant
cloud architect	consulting	cloud partners
cloud security specialist		DevOps team	informed
cloud developer	informed	business stakeholder
cloud engineer	accountable	architecture
cloud analyst		security
product owner CCoE		finance
management		procurement

Have a question about the cloud governance framework? Get in contact.

category

platform

capability

platform operations

Overview

Activities checklist

RASCI

Michiel de van der Schueren

Netherlands

Australia / NZ

India