Sample Bug Triage Report
This is a sample incident investigation report for a fictional company. It demonstrates my approach to production debugging: systematic investigation, clear documentation, and actionable prevention measures.
Incident Summary
| Severity: | P1 — Customer-facing outage |
| Duration: | 47 minutes (14:23 - 15:10 UTC) |
| Impact: | ~2,400 failed API requests, 340 affected users |
| Root Cause: | Connection pool exhaustion from unreleased database connections |
Timeline
PagerDuty triggers. On-call engineer acknowledges within 2 minutes.
Error logs show ETIMEDOUT connecting to PostgreSQL. Database metrics show normal CPU/memory. Connection count at limit (100/100).
Noticed connections accumulating over past 6 hours. Each pod holding 8-12 connections but queries per second remained constant. Connections not being returned to pool.
Deployment at 08:15 UTC introduced a new endpoint that queries the database but doesn't release the connection on error paths. Error handling returns early without calling connection.release().
Rolled back to previous deployment. Connection count begins dropping.
All pods healthy. Error rate back to baseline (<0.1%). Incident resolved.
Root Cause Analysis
The Bug
A new /api/reports/generate endpoint was added in commit a3f7c2d. The endpoint queries multiple tables to generate a report. The code acquired a database connection from the pool but only released it in the success path:
// ❌ Buggy code
async function generateReport(userId) {
const connection = await pool.getConnection();
const user = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
if (!user) {
return { error: 'User not found' }; // Connection leaked!
}
const data = await connection.query('SELECT * FROM reports WHERE user_id = ?', [userId]);
connection.release(); // Only reached on success
return { data };
}Why It Wasn't Caught
- Happy path testing: Unit tests only covered successful report generation
- Low error rate in staging: Test users always existed, so error path never triggered
- Gradual onset: Connection leak was slow (1-2 connections per minute), took 6 hours to exhaust pool
The Fix
// ✅ Fixed code
async function generateReport(userId) {
const connection = await pool.getConnection();
try {
const user = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
if (!user) {
return { error: 'User not found' };
}
const data = await connection.query('SELECT * FROM reports WHERE user_id = ?', [userId]);
return { data };
} finally {
connection.release(); // Always releases, even on error
}
}Prevention Measures
Added ESLint rule requiring try/finally pattern for all database operations. Fails CI if connection.release() is not in a finally block.
Added Datadog dashboard tracking active connections per pod. Alert if any pod holds >50% of max connections for >5 minutes.
PR template now includes checkbox: "Error paths tested and connections verified released." Code review checklist updated.
Configured pool to automatically release connections held longer than 30 seconds. Provides safety net for future leaks.
Lessons Learned
- Gradual failures are harder to catch than sudden ones — invest in trend alerts, not just threshold alerts
- Resource exhaustion often manifests as unrelated symptoms (timeouts vs. "out of connections")
- The fix is often simpler than the investigation — finding the bug is the hard part
- Prevention > detection > response — lint rules catch bugs before they ship
This report demonstrates my incident investigation methodology. I focus on systematic root cause analysis, clear timelines, and actionable prevention measures that reduce future incidents.