Overview
“A self-healing foundation is a joy for everyone”
Monitoring the health of solutions and services and the occurrence of abnormal events plays a crucial role in preventing and resolving incidents. Events that are regarded as atypical (anomalies) are detected, with or without the use of artificial intelligence techniques. The aim is to automate preventive and corrective actions resulting in a virtually “opsless” platform. A platform that, once it has been build, requires almost no manual interference from engineers.
The obvious goal is to provide the DevOps teams with a platform that is always available, and which helps them in building solutions that are resilient. Examples of how to achieve this are automatic root cause analysis, self-healing properties, immutable infrastructure and state control. Using meta data provided by the CSP, systems can be designed to scale automatically to be reprovisioned automatically in case of failure, and if needed, in another region in the event of a full-blown disaster.
Solution specific monitoring, logging, event detection and handling is decentralised to the responsible DevOps teams, but these teams will benefit from a resilient platform (and shared services) that is always available, and which helps them in building solutions that are tolerant to failures
The CSP’s offerings should be consumed as services rather than infrastructure building blocks, as these services will have self-healing properties already built in.
Activities checklist
Initial:
- Defining an approach with regards to achieving a self-healing platform and shared services
- Conducting initial tests to determine the tolerance of the platform to faults
- Adding weaknesses to the backlog and remediate them
- Implementing a process that governs the continuous improvement of the platform in terms of fault tolerance
Recurring:
- Supporting DevOps teams designing for self-healing
- Supporting DevOps teams in configuring/using tools and techniques
- Automating resolving issues
- Prioritising issues that lead to constructive improvements
- Frequently assessing CSP’s new managed services to determine potential improvement with regards to fault tolerance
- Reviewing every incident and aim to implement an improvement to the design of the platform
RASCI
| cloud consultant | transformation consultant | ||
| cloud architect |