This is a sample deliverable. It demonstrates the format and depth of an architecture review. The system and findings are illustrative, based on patterns I commonly encounter.
Architecture Review Report
System: Acme Commerce Platform • Reviewed: March 2026 • Reviewer: Owen Devereaux
System Overview
Acme Commerce is a B2C e-commerce platform handling ~50,000 daily active users and processing approximately 2,000 orders per day. The system consists of:
- 5 microservices: User, Product, Order, Inventory, Notification
- Infrastructure: AWS (EC2, RDS PostgreSQL, S3, SQS)
- External integrations: Stripe (payments), SendGrid (email), Twilio (SMS)
- Traffic pattern: 3x peaks during business hours, 10x during sales events
The team requested this review due to concerns about scaling for an upcoming marketing push expected to 5x normal traffic, and following a series of minor outages in the past quarter.
Review Methodology
This review examined:
Executive Summary
Overall assessment: The platform has a solid foundation but has outgrown its current architecture. Several critical issues must be addressed before the planned traffic increase—without intervention, a 5x traffic spike would likely cause a complete outage.
The good news: Most high-impact fixes are relatively low effort. Implementing the top 5 recommendations would significantly improve reliability and could reduce infrastructure costs by ~30%.
Priority Actions (Pre-Launch)
These should be completed before the marketing push:
- Add database read replicas (SCALE-001) — 1-2 weeks. Eliminates primary scaling bottleneck.
- Migrate secrets to Secrets Manager (SEC-001) — 1 week. Critical security hygiene.
- Implement circuit breakers (REL-001) — 1 week. Prevents cascade failures.
- Move sessions to Redis (SCALE-002) — 2-3 days. Enables horizontal scaling.
- Add rate limiting (SEC-002) — 2-3 days. Protects against abuse.
Findings by Category
Findings Summary
| ID | Severity | Category | Title | Effort |
|---|---|---|---|---|
| SCALE-001 | critical | 📈 Scalability | Database is a single point of failure with no read replicas | Medium |
| SEC-001 | critical | 🔒 Security | API keys stored in plaintext in environment variables shared across services | Medium |
| REL-001 | high | ⚡ Reliability | No circuit breakers on external service calls | Medium |
| SCALE-002 | high | 📈 Scalability | Session state stored in application memory | Low |
| COST-001 | high | 💰 Cost | Over-provisioned compute with no autoscaling | Low |
| SEC-002 | high | 🔒 Security | No rate limiting on public API endpoints | Low |
| REL-002 | medium | ⚡ Reliability | No health checks on background job processors | Low |
| MAINT-001 | medium | 🔧 Maintainability | Shared database across services with no schema ownership | High |
| SCALE-003 | medium | 📈 Scalability | Synchronous image processing in request path | Medium |
| COST-002 | medium | 💰 Cost | No CDN for static assets and API responses | Low |
| SEC-003 | low | 🔒 Security | Verbose error messages expose internal details | Low |
| MAINT-002 | low | 🔧 Maintainability | No API versioning strategy | Medium |
Detailed Findings
Database is a single point of failure with no read replicas
All read and write traffic hits a single PostgreSQL instance. The database is running at 78% CPU during peak hours with no horizontal scaling path.
During traffic spikes (Black Friday, marketing campaigns), the database becomes the bottleneck. Response times degrade from 200ms to 3-4 seconds. At 2x current traffic, you risk complete service outage.
Deploy read replicas and route read-heavy queries (product listings, order history) to replicas. Use connection pooling (PgBouncer) to reduce connection overhead.
Read replicas introduce replication lag (typically 10-100ms). Some queries may show stale data briefly. Need to audit which queries can tolerate eventual consistency.
API keys stored in plaintext in environment variables shared across services
Stripe API keys, SendGrid credentials, and internal service tokens are stored as plaintext environment variables. These are visible in CI logs, container inspect output, and shared across all microservices.
A compromise of any single service exposes credentials to all external services. CI logs have already leaked the SendGrid API key (found in a 6-month-old build log).
Migrate to a secrets manager (AWS Secrets Manager, HashiCorp Vault). Implement service-specific credentials with least-privilege access. Enable secret rotation.
Adds operational complexity and a new dependency. Secret manager downtime could impact deployments. Consider caching secrets with TTL to reduce dependency.
No circuit breakers on external service calls
The order service makes synchronous calls to payment processor, inventory service, and notification service. If any downstream service is slow or unavailable, requests queue up and exhaust the thread pool.
A 30-second Stripe outage last month cascaded into a 15-minute full outage. The order service ran out of threads waiting for Stripe, causing health checks to fail and triggering a restart loop.
Implement circuit breakers (e.g., resilience4j, Polly) with appropriate thresholds. Add timeouts on all external calls. Consider async processing for non-critical operations (notifications).
Circuit breakers add complexity and require tuning. Half-open states need careful handling. May need to implement fallback behaviors (e.g., queue orders for later payment processing).
Session state stored in application memory
User sessions and shopping carts are stored in Node.js process memory. This prevents horizontal scaling—users get errors when load balancer routes them to a different instance.
Cannot scale beyond a single instance per service. Deployments cause all users to lose their sessions. Currently working around this with sticky sessions, which creates uneven load distribution.
Move session storage to Redis. This enables stateless instances and proper horizontal scaling. Redis Cluster for HA if session loss is unacceptable.
Adds Redis as a dependency. Session serialization overhead is minimal (<1ms). Consider Redis persistence settings based on session importance.
Over-provisioned compute with no autoscaling
All services run on fixed-size EC2 instances (m5.xlarge) 24/7. Analysis shows average CPU utilization of 12% with peaks to 60%. No autoscaling configured.
Estimated $4,200/month in wasted compute. During off-peak hours (midnight-6am), resources are 95% idle. During peak, you're occasionally under-provisioned.
Implement autoscaling based on CPU/request metrics. Right-size base instances to m5.large. Consider Spot instances for non-critical workloads.
Autoscaling adds 1-2 minutes of scaling lag. Cold starts may impact first requests to new instances. Set appropriate minimum capacity for predictable traffic patterns.
No rate limiting on public API endpoints
Public endpoints (login, registration, password reset, product search) have no rate limiting. The search endpoint in particular is expensive (triggers full-text search).
Vulnerable to brute force attacks on authentication. A recent bot scraped the entire product catalog, costing $340 in excess database and CDN charges. Search endpoint can be weaponized for DoS.
Implement rate limiting at API gateway level. Suggested limits: 5 login attempts/minute/IP, 100 search requests/minute/IP, 1000 general requests/minute/user.
Legitimate users may hit limits during normal usage (e.g., fast typing triggers multiple searches). Implement exponential backoff responses rather than hard blocks. Consider authenticated vs. unauthenticated limits.
No health checks on background job processors
The Sidekiq workers processing order fulfillment, email sends, and report generation have no health monitoring. When a worker dies, jobs silently queue up until someone notices.
Last month, the email worker was down for 4 hours before detection. 2,400 order confirmation emails were delayed. No alerting triggered.
Add health check endpoints to workers. Monitor queue depth with alerts on thresholds. Implement dead letter queues for failed jobs.
Health checks add slight overhead. Need to define "healthy" carefully for workers (e.g., connected to Redis, processing jobs, not memory-leaking).
Shared database across services with no schema ownership
Five microservices read and write to the same PostgreSQL database. No clear ownership of tables—the order service writes to user tables, the user service writes to order tables. 47 cross-service joins in the codebase.
Schema changes are dangerous—unclear which services will break. A recent column rename caused a 2-hour outage. Teams avoid schema changes, leading to technical debt.
Define clear table ownership per service. Create API boundaries for cross-service data access. Long-term: migrate to database-per-service pattern.
Database-per-service adds operational complexity and eventual consistency challenges. Start with ownership boundaries while sharing the database, then gradually separate.
Synchronous image processing in request path
Product image uploads are processed synchronously—resized to 5 formats and uploaded to S3 during the HTTP request. This takes 8-12 seconds per image.
Product upload endpoint times out under load. Users experience hanging UI. During bulk uploads, API servers become unresponsive.
Move image processing to async queue. Return immediately after upload to temp storage. Process in background, notify via webhook when complete.
Users won't see processed images immediately. Need to handle UI state for "processing" images. Consider showing unprocessed preview while processing.
No CDN for static assets and API responses
All traffic, including static assets and cacheable API responses, goes directly to origin servers. Product images are served from S3 via the API server.
Estimated $800/month in unnecessary data transfer. Origin servers handle 10x more traffic than necessary. Latency for international users is 400-600ms.
Deploy CloudFront CDN for static assets. Add cache headers to cacheable API responses (product listings, category data). Cache product images at edge.
CDN adds caching complexity—need cache invalidation strategy. Some dynamic content can't be cached. Consider cache key design carefully.
Verbose error messages expose internal details
API error responses include stack traces, internal service names, and database query details in production. Example: "Error: relation \"users_v2\" does not exist at PostgresConnection.query"
Attackers can map internal architecture. Database table names and query patterns are exposed. Stack traces reveal library versions with known vulnerabilities.
Return generic error messages to clients. Log detailed errors server-side. Implement correlation IDs for debugging without exposing internals.
Debugging becomes slightly harder without stack traces in responses. Ensure correlation IDs are included so support can trace issues.
No API versioning strategy
The public API has no versioning. Breaking changes require coordinating with all clients. Mobile app releases are tied to API deployments.
Recent API change broke the iOS app for 6 hours until an emergency app update was approved. Development velocity limited by backward compatibility concerns.
Implement URL-based versioning (/v1/, /v2/) or header-based versioning. Maintain backward compatibility for at least 6 months per version.
Multiple versions means maintaining multiple codepaths. Consider feature flags over versions for minor changes. Define clear deprecation policy.
Prioritized Roadmap
Based on impact, effort, and dependencies, here's the recommended implementation order:
Phase 1Immediate (Before Traffic Increase) — 2-3 weeks
- SCALE-002: Move sessions to Redis (2-3 days) — Unblocks horizontal scaling
- SEC-002: Implement rate limiting (2-3 days) — Quick win, low risk
- COST-001: Enable autoscaling (1-2 days) — Can be done in parallel
- SCALE-001: Deploy read replicas (1-2 weeks) — Critical for traffic spike
Phase 2Short-term (Next 4-6 weeks)
- SEC-001: Migrate to Secrets Manager (1 week) — Security critical
- REL-001: Add circuit breakers (1 week) — Prevents cascade failures
- REL-002: Health checks for workers (2-3 days) — Improves observability
- COST-002: Deploy CDN (3-5 days) — Performance and cost improvement
Phase 3Medium-term (Next Quarter)
- SCALE-003: Async image processing (1-2 weeks) — Improves UX
- SEC-003: Fix verbose error messages (2-3 days) — Security hygiene
- MAINT-002: Implement API versioning (2 weeks) — Enables faster iteration
- MAINT-001: Define schema ownership (ongoing) — Foundation for future separation
Estimated Cost Impact
Primary savings from autoscaling (COST-001) and CDN implementation (COST-002). Does not include incident cost avoidance from reliability improvements.
Get an architecture review for your system
Starting at $500. Comprehensive analysis with actionable recommendations.
Request an Architecture Review