incident prevention & resolution​
Preventing and resolving malfunctions in the platform

category

platform

Designing, building and continuously improving the cloud platform (including landing zones, IAM, connectivity and integration) for the sole purpose of serving the business

capability

platform operations

Enabling the business to increase innovation in a controlled way by providing a platform so the operational processes are automated as much as possible

Overview

A self-healing foundation is a joy for everyone

Monitoring the health of solutions and services and the occurrence of abnormal events plays a crucial role in preventing and resolving incidents. Events that are regarded as atypical (anomalies) are detected, with or without the use of artificial intelligence techniques. The aim is to automate preventive and corrective actions resulting in a virtually “opsless” platform. A platform that, once it has been build, requires almost no manual interference from engineers. ​

The obvious goal is to provide the DevOps teams with a platform that is always available, and which helps them in building solutions that are resilient. Examples of how to achieve this are automatic root cause analysis, self-healing properties, immutable infrastructure and state control. Using meta data provided by the CSP, systems can be designed to scale automatically to be reprovisioned automatically in case of failure, and if needed, in another region in the event of a full-blown disaster.​

Solution specific monitoring, logging, event detection and handling is decentralised to the responsible DevOps teams, but these teams will benefit from a resilient platform (and shared services) that is always available, and which helps them in building solutions that are tolerant to failures​

The CSP’s offerings should be consumed as services rather than infrastructure building blocks, as these services will have self-healing properties already built in.

Activities checklist

Initial:

  • Defining an approach with regards to achieving a self-healing platform and shared services​
  • Conducting initial tests to determine the tolerance of the platform to faults​
  • Adding weaknesses to the backlog and remediate them​
  • Implementing a process that governs the continuous improvement of the platform in terms of fault tolerance

Recurring:

  • Supporting DevOps teams designing for self-healing​
  • Supporting DevOps teams in configuring/using tools and techniques​
  • Automating resolving issues​
  • Prioritising issues that lead to constructive improvements ​
  • Frequently assessing CSP’s new managed services to determine potential improvement with regards to fault tolerance​
  • Reviewing every incident and aim to implement an improvement to the design of the platform

RASCI

cloud consultanttransformation consultant
cloud architectconsultingcloud partners
cloud security specialistDevOps teaminformed
cloud developerinformedbusiness stakeholder
cloud engineeraccountablearchitecture
cloud analystsecurity
product owner CCoEfinance
managementprocurement
Have a question about the cloud governance framework? Get in contact.

Michiel de van der Schueren

Managing Director - Rapid Circle Advisory