Production is down.
Fix it.
Services are degraded. You have a terminal and a mission. IncidentLab puts you inside realistic production incidents — because real skill isn't studied, it's built one broken system at a time.
api-gateway pods crashing at startup
Auth and API services returning 503. No recent deployments.
Objectives
Hints
{"level":"ERROR","msg":"DB connection failed"}
Error: lookup postgres on 10.96.0.10:53: no such host
panic: failed to init: database unreachable
exit status 2
The experience
The workflow of a real incident,
without the real consequences.
Each lab runs on isolated infrastructure. No shared state. No leakage between sessions.
Incident catalog
Every lab is a real failure mode.
Sourced from actual production incidents. Each scenario has a specific root cause, measurable recovery state, and automatic validation.
Failing Health Checks
Kubernetes marks pods as unhealthy and kills them before they can serve traffic.
Broken Nginx Rollout
A config change was pushed and now nginx won't start. The rollback attempt also failed.
Crashloop in Production
The api-gateway pods keep restarting. Logs show a missing dependency. No recent deployments.
Database Under Siege
Postgres CPU is at 100%. Write operations are queuing. The application is degraded.
The Haunted Load Balancer
Nginx returns 502 for 30% of requests, but only in us-east-1. Health checks are passing.
The Phantom Latency
P99 API latency spiked to 4s without any apparent cause. Database metrics look clean.
The practice
Learn to investigate.
Not memorize.
Most training gives you answers. IncidentLab gives you a broken system. The difference matters — when production fails, you won't have a textbook. You'll have logs, metrics, and a terminal.
Read the mission
Incident summary, severity, affected services, and last-known-good state.
Inspect the system
kubectl, curl, journalctl, ps — real commands against a live broken environment.
Form a hypothesis
Interpret what the evidence points to. Missing service? Bad config? Race condition?
Apply and verify
Make your change. Watch the system respond. Run validation to confirm recovery.
"Train your response under realistic conditions. Build the instincts that hold when systems actually fail."
Who it's for
Built for engineers at every stage.
You know the theory. Now practice it.
You've read the docs, watched the talks, and built side projects. But troubleshooting a real system under pressure is different. IncidentLab gives you controlled exposure to production-grade failure modes — before your pager goes off.
Instincts only come from reps.
You've handled incidents before. You know what it feels like when a system misbehaves. IncidentLab gives you more reps — unusual failure modes, rare edge cases, and scenarios specifically designed to challenge experienced engineers.
Train before production
makes you.
Join engineers building real incident-response skills through hands-on practice in live terminal environments. Not theory. Not videos. The actual work.
No credit card. No commitments. Priority access for early signups.