At HUMAN, we keep a SaaS B2B service that handles high throughput of requests, at a minimal RTT, and very strict SLA. We pride ourselves on maintaining a very strict SLA to ensure every request is analyzed for potential risk, leaving no loopholes for attackers to abuse.
How do we keep our system at five nines?
Quite simply, by incorporating a learning system, the debrief. Utilizing this approach helped us design and improve our software architecture and processes.
We debrief A LOT on any internal item we consider as an SLO breach (usually well before the actual SLA is breached or an outage affecting our customers). We have been doing it so often that it has become a practice where teams do it automatically after every incident and improve our system.
We even got an added benefit from the process: Having the learning documented so new employees can learn how and why we started doing things in a certain way.
Now, debriefs can be a source of great value, but they can also wreak havoc. The main ingredients that will get you the more desired outcomes are communication and trust.
In our case, the goal is simple - ask a few questions, usually the same ones, and see if we can do better next time.
In the words of W. Edward Deming:
“A bad system will beat a good person every time.”
It doesn’t matter who did it, since at the essence of things if one person broke the system, tomorrow another might as well. The fact we didn’t have the right measures, controls, or protections in place is what caused an incident to happen.
What do we ask:
- What happened? - timeline
- What did we do?
- How could we have identified the issue in under 5 minutes? 30 seconds?
- How could we have fixed it in under 10 minutes? 1 minute? automatically?
- What do we need to do so we will be able to answer yes to the above questions?
What do we avoid:
- Focus on why someone did something
Keeping to these questions, every time for multiple debriefs instilled a healthy culture of positive discussions and minimal backfires.
The key here is to be consistent in how you approach it. And like every process change management, start every debrief with the purpose and guidance on how the meeting will be conducted. Sticking to a rigorous, repeatable, and simple process creates confidence in the people going through it, as well as the people who consume the results (i.e. higher mgmt. tiers and customers).
Keeping a consistent approach isn’t easy. Sometimes you debrief a severe incident that had a significant impact on your system and customer, or an event caused by pure recklessness. The thing to keep in mind here is the goal - today it happened for a particular reason, tomorrow it can happen for another - how do we improve?
Over time, I found myself not needed in the debriefs and only reading the summaries. After the first few I did spot checks with team members on how was the tone, what questions were raised, and the general conduct within the debriefs. This was mostly done to validate that the guidance stuck. I was happy to see the team kept the same mindset.
Several months in, the process has been instilled. It is now part of our RnD culture (a topic to discuss on its own).
- Short description of what happened
- Dates, and hours if relevant, of what happened.
- What was the incident impact overall
- Root cause
- Make sure we understand what caused it.
- Sometimes this is complicated, so we set an AI to keep investigating.
- Every AI has a prioritized ticket assigned to someone so we can review and validate that suggested improvements are being incorporated into the system, code, and processes.
What is your experience with such a process? Do you have a different approach?