This is a sample deliverable. It demonstrates the format and depth of an architecture review. The system and findings are illustrative, based on patterns I commonly encounter.

Architecture Review Report

System: Acme Commerce Platform • Reviewed: March 2026 • Reviewer: Owen Devereaux

System Overview

Acme Commerce is a B2C e-commerce platform handling ~50,000 daily active users and processing approximately 2,000 orders per day. The system consists of:

5 microservices: User, Product, Order, Inventory, Notification
Infrastructure: AWS (EC2, RDS PostgreSQL, S3, SQS)
External integrations: Stripe (payments), SendGrid (email), Twilio (SMS)
Traffic pattern: 3x peaks during business hours, 10x during sales events

The team requested this review due to concerns about scaling for an upcoming marketing push expected to 5x normal traffic, and following a series of minor outages in the past quarter.

Review Methodology

This review examined:

✓ Infrastructure architecture diagrams

✓ Database schemas and query patterns

✓ Service communication flows

✓ Deployment and CI/CD pipelines

✓ Monitoring and alerting setup

✓ AWS cost reports (90 days)

✓ Incident reports (past 6 months)

✓ Load testing results

Executive Summary

Overall assessment: The platform has a solid foundation but has outgrown its current architecture. Several critical issues must be addressed before the planned traffic increase—without intervention, a 5x traffic spike would likely cause a complete outage.

The good news: Most high-impact fixes are relatively low effort. Implementing the top 5 recommendations would significantly improve reliability and could reduce infrastructure costs by ~30%.

Critical

High

Medium

Low

Priority Actions (Pre-Launch)

These should be completed before the marketing push:

Add database read replicas (SCALE-001) — 1-2 weeks. Eliminates primary scaling bottleneck.
Migrate secrets to Secrets Manager (SEC-001) — 1 week. Critical security hygiene.
Implement circuit breakers (REL-001) — 1 week. Prevents cascade failures.
Move sessions to Redis (SCALE-002) — 2-3 days. Enables horizontal scaling.
Add rate limiting (SEC-002) — 2-3 days. Protects against abuse.

Findings by Category

📈

Scalability

3 findings (2 urgent)

🔒

Security

3 findings (2 urgent)

⚡

Reliability

2 findings (1 urgent)

💰

Cost

2 findings (1 urgent)

🔧

Maintainability

2 findings

Findings Summary

ID	Severity	Category	Title	Effort
SCALE-001	critical	📈 Scalability	Database is a single point of failure with no read replicas	Medium
SEC-001	critical	🔒 Security	API keys stored in plaintext in environment variables shared across services	Medium
REL-001	high	⚡ Reliability	No circuit breakers on external service calls	Medium
SCALE-002	high	📈 Scalability	Session state stored in application memory	Low
COST-001	high	💰 Cost	Over-provisioned compute with no autoscaling	Low
SEC-002	high	🔒 Security	No rate limiting on public API endpoints	Low
REL-002	medium	⚡ Reliability	No health checks on background job processors	Low
MAINT-001	medium	🔧 Maintainability	Shared database across services with no schema ownership	High
SCALE-003	medium	📈 Scalability	Synchronous image processing in request path	Medium
COST-002	medium	💰 Cost	No CDN for static assets and API responses	Low
SEC-003	low	🔒 Security	Verbose error messages expose internal details	Low
MAINT-002	low	🔧 Maintainability	No API versioning strategy	Medium

Detailed Findings

Database is a single point of failure with no read replicas

SCALE-001📈 Scalability

All read and write traffic hits a single PostgreSQL instance. The database is running at 78% CPU during peak hours with no horizontal scaling path.

Impact:

During traffic spikes (Black Friday, marketing campaigns), the database becomes the bottleneck. Response times degrade from 200ms to 3-4 seconds. At 2x current traffic, you risk complete service outage.

Recommendation:Medium Effort

Deploy read replicas and route read-heavy queries (product listings, order history) to replicas. Use connection pooling (PgBouncer) to reduce connection overhead.

⚖️ Tradeoffs to Consider:

Read replicas introduce replication lag (typically 10-100ms). Some queries may show stale data briefly. Need to audit which queries can tolerate eventual consistency.

API keys stored in plaintext in environment variables shared across services

SEC-001🔒 Security

Stripe API keys, SendGrid credentials, and internal service tokens are stored as plaintext environment variables. These are visible in CI logs, container inspect output, and shared across all microservices.

Impact:

A compromise of any single service exposes credentials to all external services. CI logs have already leaked the SendGrid API key (found in a 6-month-old build log).

Recommendation:Medium Effort

Migrate to a secrets manager (AWS Secrets Manager, HashiCorp Vault). Implement service-specific credentials with least-privilege access. Enable secret rotation.

⚖️ Tradeoffs to Consider:

Adds operational complexity and a new dependency. Secret manager downtime could impact deployments. Consider caching secrets with TTL to reduce dependency.

No circuit breakers on external service calls

REL-001⚡ Reliability

The order service makes synchronous calls to payment processor, inventory service, and notification service. If any downstream service is slow or unavailable, requests queue up and exhaust the thread pool.

Impact:

A 30-second Stripe outage last month cascaded into a 15-minute full outage. The order service ran out of threads waiting for Stripe, causing health checks to fail and triggering a restart loop.

Recommendation:Medium Effort

Implement circuit breakers (e.g., resilience4j, Polly) with appropriate thresholds. Add timeouts on all external calls. Consider async processing for non-critical operations (notifications).

⚖️ Tradeoffs to Consider:

Circuit breakers add complexity and require tuning. Half-open states need careful handling. May need to implement fallback behaviors (e.g., queue orders for later payment processing).

Session state stored in application memory

SCALE-002📈 Scalability

User sessions and shopping carts are stored in Node.js process memory. This prevents horizontal scaling—users get errors when load balancer routes them to a different instance.

Impact:

Cannot scale beyond a single instance per service. Deployments cause all users to lose their sessions. Currently working around this with sticky sessions, which creates uneven load distribution.

Recommendation:Low Effort

Move session storage to Redis. This enables stateless instances and proper horizontal scaling. Redis Cluster for HA if session loss is unacceptable.

⚖️ Tradeoffs to Consider:

Adds Redis as a dependency. Session serialization overhead is minimal (<1ms). Consider Redis persistence settings based on session importance.

Over-provisioned compute with no autoscaling

COST-001💰 Cost

All services run on fixed-size EC2 instances (m5.xlarge) 24/7. Analysis shows average CPU utilization of 12% with peaks to 60%. No autoscaling configured.

Impact:

Estimated $4,200/month in wasted compute. During off-peak hours (midnight-6am), resources are 95% idle. During peak, you're occasionally under-provisioned.

Recommendation:Low Effort

Implement autoscaling based on CPU/request metrics. Right-size base instances to m5.large. Consider Spot instances for non-critical workloads.

⚖️ Tradeoffs to Consider:

Autoscaling adds 1-2 minutes of scaling lag. Cold starts may impact first requests to new instances. Set appropriate minimum capacity for predictable traffic patterns.

No rate limiting on public API endpoints

SEC-002🔒 Security

Public endpoints (login, registration, password reset, product search) have no rate limiting. The search endpoint in particular is expensive (triggers full-text search).

Impact:

Vulnerable to brute force attacks on authentication. A recent bot scraped the entire product catalog, costing $340 in excess database and CDN charges. Search endpoint can be weaponized for DoS.

Recommendation:Low Effort

Implement rate limiting at API gateway level. Suggested limits: 5 login attempts/minute/IP, 100 search requests/minute/IP, 1000 general requests/minute/user.

⚖️ Tradeoffs to Consider:

Legitimate users may hit limits during normal usage (e.g., fast typing triggers multiple searches). Implement exponential backoff responses rather than hard blocks. Consider authenticated vs. unauthenticated limits.

No health checks on background job processors

REL-002⚡ Reliability

The Sidekiq workers processing order fulfillment, email sends, and report generation have no health monitoring. When a worker dies, jobs silently queue up until someone notices.

Impact:

Last month, the email worker was down for 4 hours before detection. 2,400 order confirmation emails were delayed. No alerting triggered.

Recommendation:Low Effort

Add health check endpoints to workers. Monitor queue depth with alerts on thresholds. Implement dead letter queues for failed jobs.

⚖️ Tradeoffs to Consider:

Health checks add slight overhead. Need to define "healthy" carefully for workers (e.g., connected to Redis, processing jobs, not memory-leaking).

Shared database across services with no schema ownership

MAINT-001🔧 Maintainability

Five microservices read and write to the same PostgreSQL database. No clear ownership of tables—the order service writes to user tables, the user service writes to order tables. 47 cross-service joins in the codebase.

Impact:

Schema changes are dangerous—unclear which services will break. A recent column rename caused a 2-hour outage. Teams avoid schema changes, leading to technical debt.

Recommendation:High Effort

Define clear table ownership per service. Create API boundaries for cross-service data access. Long-term: migrate to database-per-service pattern.

⚖️ Tradeoffs to Consider:

Database-per-service adds operational complexity and eventual consistency challenges. Start with ownership boundaries while sharing the database, then gradually separate.

Synchronous image processing in request path

SCALE-003📈 Scalability

Product image uploads are processed synchronously—resized to 5 formats and uploaded to S3 during the HTTP request. This takes 8-12 seconds per image.

Impact:

Product upload endpoint times out under load. Users experience hanging UI. During bulk uploads, API servers become unresponsive.

Recommendation:Medium Effort

Move image processing to async queue. Return immediately after upload to temp storage. Process in background, notify via webhook when complete.

⚖️ Tradeoffs to Consider:

Users won't see processed images immediately. Need to handle UI state for "processing" images. Consider showing unprocessed preview while processing.

No CDN for static assets and API responses

COST-002💰 Cost

All traffic, including static assets and cacheable API responses, goes directly to origin servers. Product images are served from S3 via the API server.

Impact:

Estimated $800/month in unnecessary data transfer. Origin servers handle 10x more traffic than necessary. Latency for international users is 400-600ms.

Recommendation:Low Effort

Deploy CloudFront CDN for static assets. Add cache headers to cacheable API responses (product listings, category data). Cache product images at edge.

⚖️ Tradeoffs to Consider:

CDN adds caching complexity—need cache invalidation strategy. Some dynamic content can't be cached. Consider cache key design carefully.

Verbose error messages expose internal details

SEC-003🔒 Security

API error responses include stack traces, internal service names, and database query details in production. Example: "Error: relation \"users_v2\" does not exist at PostgresConnection.query"

Impact:

Attackers can map internal architecture. Database table names and query patterns are exposed. Stack traces reveal library versions with known vulnerabilities.

Recommendation:Low Effort

Return generic error messages to clients. Log detailed errors server-side. Implement correlation IDs for debugging without exposing internals.

⚖️ Tradeoffs to Consider:

Debugging becomes slightly harder without stack traces in responses. Ensure correlation IDs are included so support can trace issues.

No API versioning strategy

MAINT-002🔧 Maintainability

The public API has no versioning. Breaking changes require coordinating with all clients. Mobile app releases are tied to API deployments.

Impact:

Recent API change broke the iOS app for 6 hours until an emergency app update was approved. Development velocity limited by backward compatibility concerns.

Recommendation:Medium Effort

Implement URL-based versioning (/v1/, /v2/) or header-based versioning. Maintain backward compatibility for at least 6 months per version.

⚖️ Tradeoffs to Consider:

Multiple versions means maintaining multiple codepaths. Consider feature flags over versions for minor changes. Define clear deprecation policy.

Prioritized Roadmap

Based on impact, effort, and dependencies, here's the recommended implementation order:

Phase 1Immediate (Before Traffic Increase) — 2-3 weeks

SCALE-002: Move sessions to Redis (2-3 days) — Unblocks horizontal scaling
SEC-002: Implement rate limiting (2-3 days) — Quick win, low risk
COST-001: Enable autoscaling (1-2 days) — Can be done in parallel
SCALE-001: Deploy read replicas (1-2 weeks) — Critical for traffic spike

Phase 2Short-term (Next 4-6 weeks)

SEC-001: Migrate to Secrets Manager (1 week) — Security critical
REL-001: Add circuit breakers (1 week) — Prevents cascade failures
REL-002: Health checks for workers (2-3 days) — Improves observability
COST-002: Deploy CDN (3-5 days) — Performance and cost improvement

Phase 3Medium-term (Next Quarter)

SCALE-003: Async image processing (1-2 weeks) — Improves UX
SEC-003: Fix verbose error messages (2-3 days) — Security hygiene
MAINT-002: Implement API versioning (2 weeks) — Enables faster iteration
MAINT-001: Define schema ownership (ongoing) — Foundation for future separation

Estimated Cost Impact

Current Monthly Spend

$14,200

Projected After Fixes

$10,100

Annual Savings

~$49,000

Primary savings from autoscaling (COST-001) and CDN implementation (COST-002). Does not include incident cost avoidance from reliability improvements.

Get an architecture review for your system

Starting at $500. Comprehensive analysis with actionable recommendations.

Request an Architecture Review