Production is broken. Alerts are firing. Someone important is asking what's going on.

This is not the time to panic. This is the time to be methodical.

Here's the approach I use when debugging production issues — the same process whether it's a 500 error at 2 AM or a subtle data corruption discovered during an audit.

The first 5 minutes

Before you touch anything, gather context.

What changed? Check recent deployments, config changes, infrastructure updates. Most production issues correlate with recent changes. git log --oneline -10 and your deployment history are your friends.

What's the blast radius? Is this affecting all users, specific regions, certain account types? The scope tells you where to look and how urgent this really is.

What do the metrics say? CPU, memory, error rates, latency. Don't guess — look at the data. A CPU spike tells a different story than a memory leak.

The debugging loop

Once you have context, enter the loop:

1. Form hypothesis
2. Gather evidence
3. Confirm or reject
4. Repeat until root cause

This sounds obvious, but under pressure, people skip steps. They jump to solutions before understanding the problem. Don't be that person.

Form hypothesis

Based on your context, what could be causing this? Be specific. "The database is slow" is not a hypothesis. "The new query on the orders endpoint is doing a full table scan" is a hypothesis.

Good hypotheses are:

  • Falsifiable — you can prove them wrong
  • Specific — they point to a particular component
  • Connected to evidence — based on what you observed, not random guessing

Gather evidence

Now prove or disprove your hypothesis. The tools depend on your stack, but the pattern is universal:

Logs — grep for error patterns, trace request IDs, look for the timestamp when things broke

grep -i error /var/log/app.log | tail -100

Metrics — correlate with the incident time window. What spiked? What flatlined?

Traces — follow a failing request through your system. Where does it slow down or error?

The database — running queries, connection pools, explain plans

SELECT * FROM pg_stat_activity WHERE state != 'idle';
EXPLAIN ANALYZE SELECT ...;

Confirm or reject

This is where discipline matters. If your hypothesis was wrong, don't cling to it. Form a new one based on what you learned.

If your hypothesis was right, dig deeper. "The query is slow" isn't enough. Why is it slow? Missing index? Lock contention? Bad plan because of stale statistics?

Common root causes

After debugging enough incidents, patterns emerge. Here's what I see most often:

1. The deploy that should have been fine

Something changed. It might be code, config, or dependencies. Rollback first, investigate second. You can always redeploy once you understand the issue.

2. Resource exhaustion

Connection pools, file descriptors, memory limits. Systems have boundaries, and hitting them causes cascading failures. Check your limits, check your current usage.

3. External dependencies

Third-party APIs, DNS, CDNs. Your code might be perfect, but if Stripe is having a bad day, so are you. Check status pages, check your outbound requests.

4. Data issues

Bad data in, bad behavior out. Null values where you expected objects. Missing records. Schema drift between services. Validate your assumptions about data.

5. The thundering herd

Cache expires, everyone hits the database at once, the database falls over. Cron jobs that all fire at midnight. Retry storms after an outage. Coordination failures.

Tools I reach for

General:

  • htop / top — what's using resources
  • netstat / ss — network connections
  • strace — what syscalls is this process making
  • lsof — what files/sockets are open

Logs:

  • journalctl — systemd logs with filtering
  • grep, awk, jq — parsing log output
  • Your log aggregator (Datadog, CloudWatch, etc.)

Database (Postgres):

  • pg_stat_activity — current queries
  • pg_stat_statements — query performance history
  • EXPLAIN ANALYZE — understand query plans

Application:

  • Request tracing (your APM tool)
  • Feature flags (check what's enabled)
  • Error tracking (Sentry, Bugsnag, etc.)

What to write down

While debugging, keep a running log:

16:05 - Alert fired: 500 errors spike on /api/orders
16:06 - Checked recent deploys: v2.3.1 deployed at 16:00
16:08 - Error logs show: "connection refused to redis"
16:10 - Redis status: OOM killer triggered at 16:02
16:12 - Root cause: Redis memory limit too low for new caching feature
16:15 - Fix deployed: increased Redis memory, restarted
16:20 - Error rate back to normal, monitoring

This log is your incident report draft. It's also invaluable when someone asks "what happened?" while you're still fixing things.

After the fire is out

Production is stable. Now do the actual work:

Write the postmortem. What happened, why, and how to prevent it. Be blameless and specific. "Human error" is not a root cause. "Lack of validation on the memory config field allowed deployment of an invalid value" is a root cause.

Fix the systemic issue. If you can push a bad config to production, the config system needs guardrails. If the database can be overwhelmed by a single query, you need query timeouts. The incident revealed a gap — close it.

Update your runbooks. The next person who sees this alert should have better starting context than you did.

The mindset

Debugging production is stressful, but it's also where you level up fastest. Each incident teaches you something about your system that no amount of code review would reveal.

Stay calm. Be methodical. Take notes. Fix it for real, not just for now.

And when it's over, get some sleep. You earned it.


Need help debugging a production issue or building systems that are easier to debug? Let's talk.

React to this post: