Limited access · opening in waves

Production is down.
Fix it.

Services are degraded. You have a terminal and a mission. IncidentLab puts you inside realistic production incidents — because real skill isn't studied, it's built one broken system at a time.

Real terminalsRealistic failuresAutomated validationUnder-pressure drills
KP
AL
MR
JS
TN
2,400+ engineers requested access
api-gateway · CrashLoopBackOff
P1 CRITICAL47:23
Mission Briefing

api-gateway pods crashing at startup

Auth and API services returning 503. No recent deployments.

Objectives

Pods Running (0/2)
Health check passing
Latency < 500ms

Hints

kubernetesnetworkinghard
bash·production-cluster
~ 47:23 elapsed
$ kubectl logs api-gateway-7d4b9f-xkp2m --previous

{"level":"ERROR","msg":"DB connection failed"}

Error: lookup postgres on 10.96.0.10:53: no such host

panic: failed to init: database unreachable

exit status 2

$ 
2 pods failing
|
47:23 elapsed

The experience

The workflow of a real incident,
without the real consequences.

01

Enter a broken environment

No setup. A broken system is already live — investigate it.

02

Trace symptoms, inspect systems

Run real tools against a live system. Trace logs, inspect state, narrow the root cause.

03

Fix the underlying cause

Make your change. It takes effect immediately — in a real, isolated environment.

04

Validate the recovery

Automated checks confirm recovery. See your time-to-resolution and the commands that got you there.

Each lab runs on isolated infrastructure. No shared state. No leakage between sessions.

Incident catalog

Every lab is a real failure mode.

Sourced from actual production incidents. Each scenario has a specific root cause, measurable recovery state, and automatic validation.

EasyMediumHardExpert
Easy#01

Failing Health Checks

Kubernetes marks pods as unhealthy and kills them before they can serve traffic.

kubernetesnetworking
Medium#02

Broken Nginx Rollout

A config change was pushed and now nginx won't start. The rollback attempt also failed.

nginxlinuxops
Hard#03

Crashloop in Production

The api-gateway pods keep restarting. Logs show a missing dependency. No recent deployments.

kubernetesdebuggingnetworking
Hard#04

Database Under Siege

Postgres CPU is at 100%. Write operations are queuing. The application is degraded.

postgresqllinuxperformance
Expert#05

The Haunted Load Balancer

Nginx returns 502 for 30% of requests, but only in us-east-1. Health checks are passing.

nginxnetworkinglinux
Expert#06

The Phantom Latency

P99 API latency spiked to 4s without any apparent cause. Database metrics look clean.

profilinglinuxdebugging
+24 more labs in development·vote for the next scenario →

The practice

Learn to investigate.
Not memorize.

Most training gives you answers. IncidentLab gives you a broken system. The difference matters — when production fails, you won't have a textbook. You'll have logs, metrics, and a terminal.

01

Read the mission

Incident summary, severity, affected services, and last-known-good state.

02

Inspect the system

kubectl, curl, journalctl, ps — real commands against a live broken environment.

03

Form a hypothesis

Interpret what the evidence points to. Missing service? Bad config? Race condition?

04

Apply and verify

Make your change. Watch the system respond. Run validation to confirm recovery.

"Train your response under realistic conditions. Build the instincts that hold when systems actually fail."

nginx-rollout-502 · Hard · 12:47 elapsed
DEGRADED
$ curl -I http://api.internal/health
HTTP/1.1 502 Bad Gateway
Server: nginx/1.24.0
Date: Mon, 15 Jan 2026 14:51:02 GMT
$ nginx -T 2>&1 | grep -A2 'upstream'
upstream backend {
server 127.0.0.1:3000;
server 127.0.0.1:3001;
}
$ ss -tlnp | grep ':300'
# no output — nothing listening on 3000 or 3001
$ systemctl status app-server
● app-server.service - Application Server
Loaded: loaded (/etc/systemd/...)
Active: failed (Result: exit-code)
Process: ExecStart=/usr/bin/node server.js (code=exited, status=1)
$ journalctl -u app-server -n 5 --no-pager
Jan 15 14:48:11 node[4729]: Error: Cannot find module "./config"
Jan 15 14:48:11 node[4729]: Require stack:
Jan 15 14:48:11 systemd[1]: app-server.service: Failed.
$ 
502 · upstream offline|nginx / linux

Who it's for

Built for engineers at every stage.

Getting production reps

You know the theory. Now practice it.

You've read the docs, watched the talks, and built side projects. But troubleshooting a real system under pressure is different. IncidentLab gives you controlled exposure to production-grade failure modes — before your pager goes off.

Junior engineers moving from tutorials to real ops
Developers taking on infrastructure responsibilities
Anyone studying for SRE or DevOps roles
Engineers who've never debugged a live incident
Sharpening operational instincts

Instincts only come from reps.

You've handled incidents before. You know what it feels like when a system misbehaves. IncidentLab gives you more reps — unusual failure modes, rare edge cases, and scenarios specifically designed to challenge experienced engineers.

Senior engineers and staff engineers in infra/platform teams
SREs who want structured practice outside on-call
DevOps engineers exploring unfamiliar systems
Engineering leads keeping their technical edge sharp
Access opening in waves · strong early demand

Train before production
makes you.

Join engineers building real incident-response skills through hands-on practice in live terminal environments. Not theory. Not videos. The actual work.

No credit card. No commitments. Priority access for early signups.

2,400+access signups
30+labs in development
6incident categories