This is the third post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 1, How To Establish a SEV Management Program, and part 2, Understanding The Role Of The Incident Manager On-Call (IMOC).
TLOCs are technical experts from different service areas. They’re charged with diagnosing, mitigating, and resolving SEVs as quickly and safely as possible. But they aren’t burdened with keeping engineers calm or keeping management in the loop—that’s the IMOCs’ job. Rather, a TLOC settles in the trenches and stays laser-focused on technical problem solving, calling up to the IMOC for help—or to give status updates—only when necessary.
Other engineers respect the TLOC’s need to focus, but are ready to jump in and help when called on—the TLOC works heroically, but not alone!
After a SEV, the TLOCs work with their service teams to determine its root cause and create action items—for example, fixing a bug or deprecating some legacy system. After any fixes, the TLOCs lead chaos experiments to ensure the SEV doesn’t recur. These experiments are like integration tests, but for your entire application stack.
Over time, this post-SEV practice improves MTBF (mean time between failure) and MTTP (mean time to prevention).
TLOCs take turns being on-call, of course. If your engineering team is small (e.g. 5 engineers), you’ll create a single TLOC rotation that covers all service areas. If your engineering team is larger (e.g. 50 engineers), you’ll create one TLOC rotation for each service area. How those service areas break down depends on the size of your team.
Suppose you have 10 engineers. That’s enough for two rotations, given that an ideal rotation has five TLOCs. (Any more, and no TLOC will be on-call often enough to stay sharp; Any fewer, and the TLOCs may burn out.) With two rotations, you need to break down your services into two buckets. For example:
TLOC Rotation 1 - Infrastructure Engineering Services: Responsible for internal services such as MySQL, Memcache, Amazon S3, Kafka, Monitoring, and Self-Healing Software.
TLOC Rotation 2 - Product Engineering Services: Responsible for customer-facing services such as UI, Billing, Web Apps, Desktop Apps, and Mobile Apps.
During any given week, each rotation designates a Primary and a Secondary TLOC. At any given moment, however, each rotation has only one acting TLOC. Letting a single engineer take charge keeps everything moving forward, which improves your mean time to diagnosis (MTTD) and mean time to resolution (MTTR).
Your two rotations might look like this:
Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | |
---|---|---|---|---|---|
Infra | Primary: Prima | Primary: Sylvain | Primary: Atul | Primary: Diane | Primary: Eric |
Secondary: Sylvain | Secondary: Atul | Secondary: Diane | Secondary: Eric | Secondary: Prima | |
Product | Primary: Christophe | Primary: Gillian | Primary: Hank | Primary: Isabel | Primary: Juan |
Secondary: Gillian | Secondary: Hank | Secondary: Isabel | Secondary: Juan | Secondary: Christophe |
Notice that each TLOC serves first as Secondary, then as Primary the week after. This lets a likely-rusty TLOC warm up as he or she returns for duty. The TLOCs should meet weekly so the most recent on-calls can share lessons learned and hand off action items to the next on-calls.
As your engineering team grows, you’ll add more 5-person rotations and redefine your service buckets in whatever way makes sense for your company. For example, if your Mobile App is more complex and less stable than other parts of your stack, it may deserve its own TLOC rotation as your team grows to 15. A different company may give their Web App its own rotation.
Since TLOCs are solely responsible for driving technical resolution of SEVs, new TLOCs must receive training before their first on-call rotation. One or more experienced TLOCs should hold a one-hour, face-to-face training session, covering:
After training, add each new TLOC to the pager rotation for their service area—and test that they actually receive pages. Also test that pages roll over to Secondary TLOCs when the Primary doesn't answer within one minute.
Finally, give each TLOC full access to any monitoring, reliability, networking, and performance tools and dashboards.
The Technical Lead On-Call (TLOC) is a technical expert who diagnoses and resolves high severity incidents (SEVs) quickly but safely. This post has shown you how to think about the TLOC role and establish TLOC rotations at your company. If you want to become a TLOC at your company, just ask your Engineering Manager—it’s a fantastic opportunity for any engineer. If you’re already a TLOC, share your war stories with us in the comments!
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started