Skip to main content
← Back to Samples

Sample Bug Triage Report

This is a sample incident investigation report for a fictional company. It demonstrates my approach to production debugging: systematic investigation, clear documentation, and actionable prevention measures.

Incident Summary

Severity:P1 — Customer-facing outage
Duration:47 minutes (14:23 - 15:10 UTC)
Impact:~2,400 failed API requests, 340 affected users
Root Cause:Connection pool exhaustion from unreleased database connections

Timeline

14:23 UTC — Alert: API error rate exceeds 5%

PagerDuty triggers. On-call engineer acknowledges within 2 minutes.

14:25 UTC — Initial triage

Error logs show ETIMEDOUT connecting to PostgreSQL. Database metrics show normal CPU/memory. Connection count at limit (100/100).

14:32 UTC — Hypothesis: connection leak

Noticed connections accumulating over past 6 hours. Each pod holding 8-12 connections but queries per second remained constant. Connections not being returned to pool.

14:41 UTC — Root cause identified

Deployment at 08:15 UTC introduced a new endpoint that queries the database but doesn't release the connection on error paths. Error handling returns early without calling connection.release().

14:48 UTC — Mitigation deployed

Rolled back to previous deployment. Connection count begins dropping.

15:10 UTC — Service recovered

All pods healthy. Error rate back to baseline (<0.1%). Incident resolved.

Root Cause Analysis

The Bug

A new /api/reports/generate endpoint was added in commit a3f7c2d. The endpoint queries multiple tables to generate a report. The code acquired a database connection from the pool but only released it in the success path:

// ❌ Buggy code
async function generateReport(userId) {
  const connection = await pool.getConnection();
  
  const user = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
  if (!user) {
    return { error: 'User not found' }; // Connection leaked!
  }
  
  const data = await connection.query('SELECT * FROM reports WHERE user_id = ?', [userId]);
  connection.release(); // Only reached on success
  return { data };
}

Why It Wasn't Caught

The Fix

// ✅ Fixed code
async function generateReport(userId) {
  const connection = await pool.getConnection();
  
  try {
    const user = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    if (!user) {
      return { error: 'User not found' };
    }
    
    const data = await connection.query('SELECT * FROM reports WHERE user_id = ?', [userId]);
    return { data };
  } finally {
    connection.release(); // Always releases, even on error
  }
}

Prevention Measures

1. Lint rule for connection handling

Added ESLint rule requiring try/finally pattern for all database operations. Fails CI if connection.release() is not in a finally block.

2. Connection pool monitoring

Added Datadog dashboard tracking active connections per pod. Alert if any pod holds >50% of max connections for >5 minutes.

3. Error path testing requirement

PR template now includes checkbox: "Error paths tested and connections verified released." Code review checklist updated.

4. Connection timeout

Configured pool to automatically release connections held longer than 30 seconds. Provides safety net for future leaks.

Lessons Learned


This report demonstrates my incident investigation methodology. I focus on systematic root cause analysis, clear timelines, and actionable prevention measures that reduce future incidents.

Work With MeView More Samples