Production is down.
Fix it.

Services are degraded. You have a terminal and a mission. IncidentLab puts you inside realistic production incidents — because real skill isn't studied, it's built one broken system at a time.

Real terminalsRealistic failuresAutomated validationUnder-pressure drills

Request Access Explore the labs

2,400+ engineers requested access

api-gateway · CrashLoopBackOff

P1 CRITICAL47:23

Mission Briefing

api-gateway pods crashing at startup

Auth and API services returning 503. No recent deployments.

Objectives

Pods Running (0/2)

Health check passing

Latency < 500ms

Hints

kubernetesnetworkinghard

bash·production-cluster

~ 47:23 elapsed

$ kubectl logs api-gateway-7d4b9f-xkp2m --previous

{"level":"ERROR","msg":"DB connection failed"}

Error: lookup postgres on 10.96.0.10:53: no such host

panic: failed to init: database unreachable

exit status 2

2 pods failing

47:23 elapsed

The experience

Operating in real production,
without the real consequences.

Enter a system you don't fully understand yet

No setup. Something is already broken — investigate it.

→

Trace symptoms, inspect systems

Run real tools against a live system. Trace logs, inspect state, narrow the root cause.

→

Fix the underlying cause

Make your change. It takes effect immediately — in a real, isolated system.

→

Validate the recovery

Automated checks confirm recovery. See your time-to-resolution and the commands that got you there.

Each lab runs on isolated infrastructure. No shared state. No leakage between sessions.

Incident catalog

Every lab is a real failure mode.

Sourced from actual production incidents. Each scenario has a specific root cause, measurable recovery state, and automatic validation.

EasyMediumHardExpert

Easy#01

Failing Health Checks

Kubernetes marks pods as unhealthy and kills them before they can serve traffic.

kubernetesnetworking

Medium#02

Broken Nginx Rollout

A config change was pushed and now nginx won't start. The rollback attempt also failed.

nginxlinuxops

Hard#03

Crashloop in Production

The api-gateway pods keep restarting. Logs show a missing dependency. No recent deployments.

kubernetesdebuggingnetworking

Hard#04

Database Under Siege

Postgres CPU is at 100%. Write operations are queuing. The application is degraded.

postgresqllinuxperformance

Expert#05

The Haunted Load Balancer

Nginx returns 502 for 30% of requests, but only in us-east-1. Health checks are passing.

nginxnetworkinglinux

Expert#06

The Phantom Latency

P99 API latency spiked to 4s without any apparent cause. Database metrics look clean.

profilinglinuxdebugging

+24 more labs in development·vote for the next scenario →

Who it's for

Built for engineers at every stage.

Getting production reps

You know the theory. Now be ready for it.

You've read the docs, watched the talks, and built side projects. But troubleshooting a real system under pressure is different. IncidentLab gives you controlled exposure to production-grade failure modes — before your pager goes off.

Junior engineers moving from tutorials to real ops

Developers taking on infrastructure responsibilities

Anyone learning to operate in real production

Engineers who've never had to fix things breaking in production

Sharpening operational instincts

Instincts only come from reps.

You've handled incidents before. You know what it feels like when a system misbehaves. IncidentLab gives you more reps — unusual failure modes, rare edge cases, and scenarios specifically designed to challenge experienced engineers.

Senior engineers and staff engineers in infra/platform teams

SREs who want to handle more than on-call throws at them

DevOps engineers exploring unfamiliar systems

Engineering leads keeping their technical edge sharp

Production is on you.
Are you ready?

Join engineers who handle real production failures through hands-on labs.
Not theory. Not videos. The actual work.

No credit card. No commitments. Priority access for early signups.

2,400+access signups

30+labs in development

6incident categories

Production is down.Fix it.

Operating in real production,without the real consequences.