
· 46 min read
API Performance Optimization: From Simple Fixes to High-Scale Solutions
A hands-on guide to faster and more reliable APIs.
Part 1: Why Performance is a Feature (Not Just Engineering Vanity)
Performance optimization often gets dismissed as engineering for engineering’s sake. Developers quote Knuth’s “premature optimization is the root of all evil” as an excuse to ignore performance entirely. Product wants features. The CEO wants revenue. Investors want growth. Nobody’s openly asking for millisecond improvements in API response times, so teams continue shipping slow code, confident they’re making the right trade-off.
Often, they’re wrong.
Performance directly impacts every business metric teams claim to care about - customer satisfaction, conversion rates, revenue, infrastructure costs, even developer productivity. Slow systems aren’t just a technical inconvenience; they’re a business liability.
Code should be written in a reasonably optimized manner, proportionate to current and foreseeable needs. Whether optimization is “premature” or well-planned is a cost/benefit judgment call - and one that’s hard to make without understanding what good performance looks like. This article is about building that understanding.
The Business Case for Speed
Amazon famously reported that every 100ms of latency costs them 1% in sales. Google found that an extra 0.5 seconds in search page generation time dropped traffic by 20%. These aren’t abstract engineering metrics - they’re revenue numbers.
For consumer-facing APIs, the relationship is straightforward: slower responses mean lower conversion rates. Users abandon checkout flows, close apps, and switch to competitors. Mobile users are particularly sensitive - on 4G connections, every extra second of load time matters.
For B2B APIs, performance is often contractual. Your SLA might specify p95 latency under 200ms. Miss that consistently and you’re paying penalties or losing customers.
Then there’s infrastructure cost. Faster APIs need fewer servers. If you can reduce response time by 50%, you might be able to handle the same load with half the instances. At scale, that’s significant. One team I worked with cut their AWS bill by 30-40% - not by changing infrastructure, but by refactoring code, so that it was making their code spend cycles on what’s actually necessary - instead of de facto heat generation.
What Performance Actually Means
Performance isn’t one number - it’s at least three different things that often conflict with each other.
Latency is how long a single request takes. This is what users feel directly. A 50ms API call feels instant. A 500ms call feels sluggish. A 5-second call feels broken.
Throughput is how many requests you can handle per second. You might have 50ms latency under normal load, but if your system can only handle 100 requests/second and you’re getting 1,000, those request times are about to explode.
Reliability is what percentage of requests actually succeed within acceptable time. A system that’s fast 95% of the time but times out 5% of the time is worse than a system that’s consistently slightly slower. You will remember that as soon as you end up on the 5% side of things.
Where it gets tricky: optimizing for one often hurts the others. Caching improves latency but can hurt reliability (stale data). Connection pooling improves throughput but can increase latency (queuing). You’re constantly making trade-offs.
Why Averages Lie: The Tyranny of Tail Latency
Your monitoring dashboard shows average response time: 100ms. Looks great, right? Ship it.
Except average is usually a useless metric. If 95% of your requests complete in 50ms but 5% take 2 seconds, your average might be 150ms. That looks fine. But 5% of your users are having a terrible experience.
This is why we talk about percentiles: p50 (median), p95 (95th percentile), p99 (99th percentile). The p95 means “95% of requests are faster than this.” If your p95 is 500ms, that means 1 in 20 requests takes at least half a second.
For most APIs, you should optimize for p95 or p99, not average. That’s where your actual user experience lives. Sure, most requests are fast - but the slow ones are the ones users complain about.
The challenge with tail latency is that it’s often caused by different problems than average latency. Slow averages usually mean your code is inefficient or your database is overloaded. Slow tail latency often means: garbage collection pauses, connection pool exhaustion, slow external APIs, cache misses, database query variance, or network issues.
Perceived Performance vs. Measured Performance
Users don’t experience your API latency directly - they experience your UI responsiveness. A 100ms API call can feel instant if your UI is well-designed (optimistic updates, skeleton screens, progressive loading). A 50ms API call can feel slow if your UI blocks on it.
This matters when prioritizing optimization work. Sometimes the right answer isn’t making your API faster - it’s making your client smarter about when and how it calls your API.
Setting Realistic SLOs
Service Level Objectives (SLOs) are your performance targets. Not aspirational goals - actual commitments you design and operate to.
Good SLOs are specific: “95% of /api/search requests will complete in under 200ms.” Bad SLOs are vague: “The API should be fast.”
A reasonable starting point for different API types (yes, it’s a very broad generalization):
Read-heavy endpoints (product listings, user profiles): p95 < 100ms, p99 < 200ms
Write endpoints (create order, update profile): p95 < 200ms, p99 < 500ms
Search and complex queries: p95 < 500ms, p99 < 1s
Batch operations: p95 < 5s, p99 < 10s
Real-time/streaming: p95 < 50ms for initial connection
These are starting points, not rules. Your SLOs depend on your use case. A payment API might need p95 < 100ms because users are waiting. An analytics dashboard API might be fine with p95 < 2s because users expect some delay.
The key is: define them, measure against them, and alert when you’re not meeting them. An SLO you don’t measure is just wishful thinking.
When to Optimize (and When Not To)
Premature optimization is real. I’ve seen teams spend weeks optimizing code that runs once per day while their main API is timing out under load.
But it doesn’t mean that you shouldn’t think about performance until it’s a problem. Some decisions are expensive to change later: database schema design, API contracts, choice of synchronous vs. asynchronous patterns.
Some rules of thumb:
Optimize now if:
- You’re designing a core API that will be called thousands of times per request
- You’re choosing between architectural patterns (sync vs. async, monolith vs. microservices)
- You’re at risk of missing contractual SLAs
- Your infrastructure costs are growing faster than your revenue
- You’re hitting hard limits (database connections, file descriptors)
Optimize later if:
- You’re building a new feature with unknown adoption
- You’re in a low-traffic part of the system
- The optimization would require significant refactoring
- You don’t have monitoring to validate the improvement
Don’t optimize if:
- You’re guessing about where the problem is (measure first)
- The code runs infrequently (once per day is not a hot path)
- The optimization results in a small improvement yet makes the code significantly harder to understand
The “we’ll fix it later” trap is real, though. I’ve seen too many teams ship something slow, planning to optimize it post-launch, and then never finding the time. Technical debt accumulates. If you know something will be slow and high-traffic, fix it before launch - refactoring under production load is painful.
Part 2: Measure Before You Move (Observability Fundamentals)
You can’t optimize what you can’t measure. This sounds obvious, but I’ve seen engineers spend days optimizing the wrong thing because they “had a good hunch” and didn’t want to “waste” a few extra hours on proper investigation and measurement.
The Golden Signals: What to Actually Monitor
There are four metrics that matter for every service: latency, traffic, errors, and saturation. These are the “golden signals” from Google’s SRE book, and they’re golden for a reason - they tell you what’s wrong and where to look.
Latency: How long requests take. Track this as a distribution (p50, p95, p99), not an average. Break it down by endpoint, by customer, by region. Latency tells you when something is slow, but not why.
Traffic: How many requests you’re handling. Requests per second, broken down by endpoint. This tells you if your load is increasing, if a particular endpoint is getting hammered, or if traffic patterns have changed.
Errors: What percentage of requests are failing. Track by error type (4xx vs. 5xx), by endpoint, over time. A spike in errors often predicts a spike in latency - failing requests are usually faster than slow successful ones.
Saturation: How “full” your system is. CPU usage, memory usage, database connections, queue depth. Saturation tells you when you’re about to fall over. If you’re at 80% CPU and traffic doubles, you’re in trouble.
For services, use RED metrics (Rate, Errors, Duration). For resources (databases, caches), use USE metrics (Utilization, Saturation, Errors). Between these two frameworks, you’ve got most of what you need.
Structured Logging and Correlation IDs
Every request should get a correlation ID (trace ID) at the entry point. This ID flows through every service, every database query, every external API call. When something goes wrong, you can grep for that ID and see the entire request path.
Log in structured JSON, not free-text. This makes logs searchable and analyzable:
{
"timestamp": "2025-10-12T14:32:11.234Z",
"level": "INFO",
"correlation_id": "01HF4F2X...",
"service": "api-gateway",
"endpoint": "/api/orders",
"method": "POST",
"duration_ms": 234,
"status": 200,
"user_id": "12345"
}Now you can query: “Show me all requests for user 12345 that took longer than 1 second” or “Show me all 500 errors from the orders endpoint in the last hour.”
Distributed Tracing: Following Requests End-to-End
Logs tell you what happened. Traces tell you where time was spent.
A distributed trace shows the entire request path: API gateway (5ms) → Auth service (10ms) → Orders service (50ms) → Database query (40ms) → Payment provider (120ms) → Response. You can immediately see that the payment provider is your bottleneck.
Tools like Jaeger, Zipkin, or Tempo (open source) or DataDog, New Relic (commercial) provide distributed tracing. The key is instrumenting your code to create spans for each operation and link them via context propagation.
The cost of tracing is non-zero - creating spans and shipping them to a collector adds overhead. Use sampling: trace 1% of requests, or trace all slow requests (tail-based sampling). You don’t need 100% coverage to find bottlenecks.
Flame Graphs: Visualizing Where Time Goes
A flame graph shows your code execution as a stack of horizontal bars. The wider the bar, the more time spent in that function. It makes performance problems obvious: you can literally see that 60% of your API time is spent in one database query.
Most profilers (Go’s pprof, Java’s async-profiler, Python’s py-spy) can generate flame graphs. Make sure you can enable them on demand in production - you’ll need them when things go wrong.
APM Solutions vs. Open Source: The Trade-offs
Commercial APM tools (DataDog, New Relic, Dynatrace) are expensive but turnkey. You install an agent, and you get metrics, tracing, profiling, alerting out of the box. The value is speed - you’re paying to not build and maintain observability infrastructure.
Open source (Prometheus + Grafana + Jaeger/Tempo) is often called “free” - a convenient omission of the infrastructure and people costs required to run it. The upside: full control, customization, and at scale, significant savings.
My recommendation: start with commercial if you can afford it. The faster you get observability, the faster you find problems. Switch to open source when the bill becomes painful or you need customization that commercial tools don’t provide.
Finding Bottlenecks: The Detective Work
Once you have observability, finding bottlenecks is detective work. The usual process:
Start with the slowest endpoint. Look at your latency dashboard. Sort by p95. Pick the slowest endpoint that gets meaningful traffic.
Look at a distributed trace. Find a slow request (p99). Follow it through the system. Where did time go? Common culprits: database queries, external API calls, lock contention, serialization/deserialization.
Check database slow query logs. Most databases log queries that take longer than a threshold (e.g., 100ms). Look for queries that are running frequently and slowly. These are your low-hanging fruit.
Profile CPU and memory. If database queries are fast but your API is slow, the problem is in your code. Run a CPU profiler. Look for functions that consume disproportionate CPU. Common problems: unnecessary loops, inefficient algorithms, expensive serialization.
Check external dependencies. If you’re calling other APIs, how long do they take? Are timeouts set appropriately? Is one provider consistently slow?
Look for saturation. Is CPU at 100%? Database connections at max? Queue depth growing? Saturation means you’re at capacity and need to scale or optimize.
Load Testing That Actually Reveals Problems
Load testing in staging rarely matches production. The data is smaller, the traffic patterns are artificial, and nobody’s surprised when staging falls over.
But you still need load testing. To make it useful:
Use realistic traffic mixes. Don’t just hammer GET /health. Use production traffic patterns: 60% reads, 30% writes, 10% searches. Vary the request sizes, the user types, the data accessed.
Ramp up gradually. Start at your normal load. Increase by 10% every minute. Watch your latency and error rate. When do they start degrading? That’s your current capacity.
Run soak tests. Sustained load for hours or days. This finds memory leaks, connection leaks, and slow degradation that won’t show up in 10-minute tests.
Run spike tests. Go from normal load to 5x load instantly. See how your system handles sudden traffic spikes. Does it degrade gracefully or fall over?
Load test in production. The only realistic load test is production. Use canary deployments and traffic shifting (Istio VirtualService, Linkerd, or your ingress controller’s weighted routing) to send 1% of traffic to the new version. Monitor latency and errors. If it looks good, increase to 5%, then 25%, then 100%. If it degrades, roll back instantly.
Part 3: Quick Wins
These optimizations take minimal effort and provide immediate results.
Connection Reuse: The 10-Second Fix
A common performance disaster: opening a new database connection for every request. Each connection requires a TCP handshake, TLS negotiation, and authentication. That’s 20-50ms of overhead before your query even runs.
The fix: connection pooling. Most database clients support this out of the box. You configure a pool of persistent connections that get reused across requests.
// Bad: new connection per request
func HandleRequest() {
db, _ := sql.Open("postgres", connString)
defer db.Close()
db.Query("SELECT ...")
}
// Good: connection pool, configured once
type OrderService struct {
db *sql.DB
}
func NewOrderService(connString string) (*OrderService, error) {
db, err := sql.Open("postgres", connString)
if err != nil {
return nil, err
}
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)
db.SetConnMaxLifetime(5 * time.Minute)
return &OrderService{db: db}, nil
}
func (s *OrderService) GetOrder(id string) (*Order, error) {
return s.db.Query("SELECT ...") // reuses pooled connection
}Same goes for HTTP clients calling external APIs. Don’t create a new http.Client for each request. Create one client with connection pooling and reuse it.
HTTP/1.1 persistent connections (keep-alive) are usually enabled by default in most clients - the catch is you need to reuse the same client instance and fully read/close response bodies, or connections won’t be returned to the pool.
HTTP/2 takes this further with connection multiplexing - multiple requests share a single TCP connection. If you’re making many requests to the same service, HTTP/2 can dramatically reduce latency.
For gRPC, configure keepalive to prevent idle connections from being dropped by load balancers:
conn, err := grpc.Dial(addr,
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 5 * time.Minute, // ping if no activity
Timeout: 20 * time.Second, // wait for ping ack
PermitWithoutStream: true, // ping even without active RPCs
}),
)For HTTP clients, configure the underlying transport’s dialer for TCP keepalive:
transport := &http.Transport{
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
}
client := &http.Client{Transport: transport}If you’re running Istio, configure TCP keepalive at the mesh level via DestinationRule:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: upstream-service
spec:
host: upstream-service.default.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
tcpKeepalive:
time: 300s # idle time before probes start
interval: 75s # probe interval
probes: 9 # probes before connection is considered deadDNS caching and TLS session resumption are similar wins. Cache DNS lookups for a few minutes instead of resolving on every request. Resume TLS sessions instead of renegotiating.
Timeouts: Don’t Wait Forever
Default timeouts are usually wrong. HTTP clients often default to no timeout or 30 seconds. That means a stuck upstream service can make your API hang for 30 seconds before failing.
Set aggressive timeouts based on your SLOs. If your SLO is 200ms, set a 150ms timeout on downstream services. This sounds scary - what if the downstream service needs 200ms sometimes? Then fail fast and retry, or degrade gracefully. A fast failure is better than a slow failure.
const upstreamTimeout = 150 * time.Millisecond
func callUpstream(ctx context.Context, client *http.Client, url string) (*http.Response, error) {
ctx, cancel := context.WithTimeout(ctx, upstreamTimeout)
defer cancel()
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return nil, err
}
return client.Do(req)
}This propagates timeouts through your entire request chain - if the parent context already has a tighter deadline, the child inherits it automatically.
Retries with Exponential Backoff and Jitter
Retries are necessary - networks fail, services restart, databases hiccup. But naive retries cause problems. If 1,000 clients all timeout at the same moment and immediately retry, you’ve just sent 2,000 requests to an already-struggling service. Congratulations, you’ve created a retry storm.
The fix: exponential backoff with jitter. First retry after 100ms, second after 200ms, third after 400ms. Add random jitter so retries don’t synchronize.
import "github.com/cenkalti/backoff/v4"
func callWithRetry(ctx context.Context) error {
b := backoff.NewExponentialBackOff()
b.InitialInterval = 100 * time.Millisecond
b.MaxElapsedTime = 30 * time.Second
return backoff.Retry(func() error {
return doSomething(ctx)
}, backoff.WithContext(b, ctx))
}Naturally - make sure that the operations you are retrying are idempotent.
Circuit Breakers: Fail Fast
When a downstream service is down, don’t keep hammering it. After N consecutive failures, open a circuit breaker - stop sending requests and return errors immediately. After a timeout period, try one request (half-open state). If it succeeds, close the circuit and resume normal operation.
This prevents cascading failures. If your payment provider is down, fail payment requests immediately instead of timing out for 30 seconds on every request.
Libraries like gobreaker (Go) or resilience4j (Java/Kotlin) implement circuit breakers at the application level.
At the infrastructure level, service meshes handle this for you. Istio’s outlierDetection ejects unhealthy pods automatically:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: upstream-service
spec:
host: upstream-service.default.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50This ejects pods that return 5 consecutive 5xx errors, keeping them out for 30 seconds before retrying.
Response Compression: The Bandwidth Win
Enabling gzip or brotli compression on your API responses is often a single config change that reduces payload sizes by 60-80%.
# nginx
gzip on;
gzip_types application/json text/plain text/css application/javascript;
gzip_min_length 1000;Brotli compresses better than gzip but takes more CPU. Use brotli for static assets (compress once, serve many times). Use gzip for dynamic API responses (compress per request, but it’s fast enough).
The trade-off: compression uses CPU. On a CPU-bound service, compression might hurt more than it helps. Measure before and after.
Pagination: Never Return Unbounded Lists
APIs that return unbounded lists (GET /users returns all users) are ticking time bombs. They work fine with 100 users. They’re slow with 10,000 users. They crash with 1,000,000 users.
Always paginate. Limit default page size to something reasonable (20-100 items). Make clients opt into larger pages.
GET /users?page=1&limit=50We’ll cover cursor-based vs. offset-based pagination later. For now, just paginate.
The N+1 Query Problem: The Hidden Performance Killer
This is the most common performance bug in web applications:
// Fetch all orders
orders := db.Query("SELECT * FROM orders WHERE user_id = ?", userID)
// For each order, fetch the customer (N+1 queries!)
for _, order := range orders {
customer := db.Query("SELECT * FROM customers WHERE id = ?", order.CustomerID)
// ...
}If there are 100 orders, this runs 101 queries: one for orders, then one per order for customers. This is slow.
The fix: eager loading. Fetch all the data you need in one or two queries.
// Fetch all orders and customers in one query with a JOIN
rows := db.Query(`
SELECT orders.*, customers.*
FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE orders.user_id = ?
`, userID)Or fetch in two queries and join in memory:
orders := db.Query("SELECT * FROM orders WHERE user_id = ?", userID)
customerIDs := extractCustomerIDs(orders)
customers := db.Query("SELECT * FROM customers WHERE id IN (?)", customerIDs)
// Join orders and customers in application codeORMs make N+1 queries easy to write accidentally.
Remove Unnecessary Logging in Hot Paths
Logging is I/O. I/O is slow. Logging inside a tight loop can destroy performance.
// Bad: logs on every iteration
for _, item := range items {
log.Printf("Processing item %s", item.ID)
process(item)
}
// Good: log once before and after
log.Printf("Processing %d items", len(items))
for _, item := range items {
process(item)
}
log.Printf("Finished processing items")Use log levels appropriately. DEBUG logs should be disabled in production. INFO logs should be sparse. ERROR logs are fine - errors should be rare.
Filter and Aggregate at the Database
Don’t fetch all rows and filter in application code. Push filtering to the database where it’s optimized.
// Bad: fetch everything, filter in code
allUsers := db.Query("SELECT * FROM users")
activeUsers := filterActive(allUsers)
// Good: filter in database
activeUsers := db.Query("SELECT * FROM users WHERE status = 'active'")Same for aggregations. Use SQL’s SUM, COUNT, AVG instead of fetching rows and computing in code.
Part 4: API Design for Performance
Sometimes the problem isn’t your implementation - it’s your API design. A poorly designed API forces clients to make multiple round-trips, fetch unnecessary data, or poll when they should be notified.
Batching: Reduce Round-Trips
If clients need to fetch 50 users, they shouldn’t make 50 separate requests. Offer a batch endpoint:
POST /users/batch
{
"user_ids": ["user1", "user2", ..., "user50"]
}Returns all 50 users in one request. The cost: more complex endpoint logic (handle partial failures, maintain backwards compatibility). The benefit: 50x fewer requests, much lower latency for clients.
Be careful with batch sizes. Don’t allow unlimited batch sizes - clients will abuse it. Limit to 100-1000 items per batch.
Batch mutations are trickier because of partial failures. If you’re updating 50 records and #23 fails, what do you do? All-or-nothing (transaction) or best-effort (partial success)? Document clearly.
Pagination Strategies: Offset vs. Cursor
Offset-based pagination is simple:
GET /users?limit=50&offset=100Returns users 100-150. Easy to implement, easy to understand, works with SQL LIMIT and OFFSET.
The problem: it’s slow on deep pages and inconsistent with concurrent writes. Fetching page 1000 requires the database to skip 50,000 rows. If a user is deleted between page 1 and page 2, results shift and clients see duplicates or miss items.
Cursor-based pagination is better for large datasets:
GET /users?limit=50&cursor=eyJpZCI6MTIzNH0The cursor is an opaque token (usually base64-encoded) that encodes the position. The server decodes it and fetches the next page relative to that position:
SELECT * FROM users
WHERE id > ?
ORDER BY id
LIMIT 50This is fast (indexed scan) and consistent (no shifting results). The trade-off: can’t jump to arbitrary pages, can’t show “page 10 of 50” in UI.
For most APIs, cursor-based pagination is the right choice once you have more than a few thousand records.
Conditional Requests: ETags and Client-Side Caching
HTTP has built-in support for client-side caching via ETags. The server sends an ETag header (typically a hash of the response):
GET /users/123
ETag: "abc123"Client caches this. On the next request, it sends:
GET /users/123
If-None-Match: "abc123"If the resource hasn’t changed, the server returns 304 Not Modified with no body. The client uses its cached version. If the resource changed, the server returns 200 with the new data and a new ETag.
This is free bandwidth and free latency for unchanged resources. The cost: you need to compute ETags (usually cheap: hash the response) and track which version each client has (usually free: HTTP headers).
GraphQL: Solving Over-Fetching, Creating New Problems
REST APIs often return more data than clients need (GET /users/123 returns 50 fields, client uses 5) or less data than clients need (client makes 10 requests to assemble one page).
GraphQL lets clients request exactly the fields they need:
query {
user(id: "123") {
name
email
orders {
id
total
}
}
}This solves over-fetching and under-fetching. The problems GraphQL introduces:
N+1 queries are easy to write accidentally. The dataloader pattern solves this: batch and cache within a single request.
Expensive queries are easy to write. Clients can request deeply nested data that requires many database joins. Use query cost limits and depth limits.
Caching is harder. REST endpoints are cacheable by URL. GraphQL has one endpoint with variable queries. Use persisted queries (pre-register allowed queries, reference by ID) to enable caching.
@defer and @stream let you return partial results immediately and stream the rest. This improves perceived performance for slow fields.
My take: GraphQL is great for mobile/frontend teams that need flexibility. It’s overkill for service-to-service APIs where REST or gRPC is simpler. If you adopt GraphQL, invest heavily in dataloaders, query cost limits, and monitoring.
gRPC: When Binary Protocols Win
gRPC uses Protocol Buffers (binary format) over HTTP/2. Compared to JSON over HTTP/1.1:
- Smaller payloads (30-50% reduction typical)
- Faster serialization/deserialization
- Built-in support for streaming
- Strong typing and code generation
The trade-offs: harder to use from browsers (needs grpc-web proxy), requires code generation pipeline.
When to use gRPC: service-to-service communication, high-throughput APIs, streaming use cases. When to stick with REST: public APIs, browser clients.
The performance difference is real but not dramatic for most use cases. I’ve seen 20-30% latency improvements from switching to gRPC, mainly from smaller payloads and HTTP/2 multiplexing. Worth it for high-traffic internal APIs. Probably not worth it for low-traffic public APIs.
Async Operations: Move Work Out of the Request Path
Some operations are inherently slow: generating reports, processing videos, sending bulk emails. Don’t make users wait.
Instead, return immediately with a job ID:
POST /reports
201 Created
Location: /reports/job123Client polls for status:
GET /reports/job123
{
"status": "processing",
"progress": 45
}When complete:
GET /reports/job123
{
"status": "complete",
"download_url": "/reports/job123/download"
}For real-time updates, use WebSockets or Server-Sent Events instead of polling. More complex, but better user experience.
Part 5: Caching (The Stratified Approach)
Caching is the most powerful performance optimization and the most error-prone. Done well, it makes your API 10-100x faster. Done poorly, it serves stale data, creates cache stampedes, and makes debugging impossible.
Caching Layers
Modern systems have multiple caching layers, each with different characteristics:
Client-side (browser, mobile app): Zero network latency, but you can’t invalidate it reliably once cached.
CDN/Edge (CloudFront, Fastly, Cloudflare): Great for static content and cacheable API responses. Limited cache size per edge location.
API gateway (nginx): In-process memory, sub-millisecond. Good for hot endpoints, limited capacity.
Application cache (Redis, Memcached): Network hop but large capacity, flexible invalidation. Most common application-level cache.
Database query cache: Database-managed. Often less useful than you’d hope because query plans change.
The strategy: cache at the highest layer possible. If a response is the same for all users, cache it at the CDN. If it varies by user but not by request, cache it in Redis. If it’s unique per request, don’t cache it.
Cache Key Design
Your cache key determines what gets cached together. Get it wrong and you’ll serve the wrong data or miss the cache constantly.
Simple key: user:{user_id}. Fine for user profile lookups.
Composite key: orders:{user_id}:{status}:{page}. Needed when the same resource varies by multiple parameters.
Be careful not to include too much in the cache key. If your cache key includes timestamp, every request is a cache miss. Include only the dimensions that actually affect the response.
Versioned URLs for static assets: /assets/app.js?v=123. Change the version number when the file changes. Now you can cache forever with no invalidation needed.
Cache Invalidation: The Hard Problem
Phil Karlton famously said “There are only two hard things in Computer Science: cache invalidation and naming things”. The difficulty is keeping cache consistent with source data when that data changes unpredictably. But you don’t need perfect invalidation - you need invalidation that’s better than hitting the database every time. An 80% cache hit rate with occasional stale data is usually better than 0% cache hit rate with perfect freshness.
Time-based (TTL): Set an expiration time. Simple, but serves stale data until TTL expires. Works well for data that changes infrequently or where staleness is acceptable.
SETEX user:123 3600 "{...}" # Expires in 1 hourEvent-based: Invalidate when data changes. Accurate, but requires publishing events and ensuring invalidation happens before responses are served.
func UpdateUser(userID string, updates UserUpdates) error {
// Update database
db.Update("users", userID, updates)
// Invalidate cache
cache.Delete("user:" + userID)
return nil
}Write-through caching: Update cache and database simultaneously. Ensures cache is always fresh, but adds complexity.
Write-behind caching: Update cache immediately, database eventually. Highest performance, highest risk (cache and database can diverge).
My recommendation: start with TTL-based caching with short expiration times (5-60 minutes depending on how fresh data needs to be). Move to event-based invalidation only for data that must be immediately consistent.
Cache Stampede Prevention
A cache stampede happens when a popular cache entry expires and many requests simultaneously try to regenerate it. All requests hit the slow path (database), overwhelming it.
Potential solutions:
Request coalescing - If 100 requests arrive for the same uncached item, only one request fetches from the database. The other 99 wait for the first to complete, then all get the result.
Probabilistic early expiration - Before TTL expires, probabilistically regenerate the cache entry. This spreads regeneration over time instead of all at once.
Lock-based regeneration - The first request to detect a missing cache entry acquires a lock, regenerates it, then releases. Other requests wait for the lock or serve slightly stale data.
Libraries like groupcache (Go) or cache-aside patterns handle this automatically.
When Caching Makes Things Worse
Caching isn’t always the answer:
Low hit rate - If you’re only hitting the cache 20% of the time, you’re adding latency and complexity for minimal benefit. Caching works best for frequently accessed data.
Memory pressure - Caching uses memory. If your cache is thrashing (constantly evicting entries), you’re wasting memory and CPU on cache management.
Debugging nightmares - Stale data in caches makes bugs hard to reproduce. “Works on my machine” becomes “works when I bypass the cache.”
Premature caching - Don’t cache until you’ve proven the underlying operation is slow. I’ve seen teams cache API responses that take 10ms, where 50ms latency would be acceptable - the cache overhead is greater than the time saved.
Measure your cache hit rate. If it’s below 50%, reconsider whether caching is helping.
Part 6: Database Optimization
Databases are often the bottleneck. Not because databases are slow - modern databases are incredibly fast. But because we use them poorly.
Query Plans and Indexes: Read EXPLAIN Output
Every slow query starts with understanding the query plan. In PostgreSQL:
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;Compare the examples:
Bad - Sequential scan on a large table
Seq Scan on orders (cost=0.00..45892.00 rows=1523 width=124) (actual time=0.015..312.401 rows=1247 loops=1)
Filter: (user_id = 123)
Rows Removed by Filter: 1847293
Planning Time: 0.089 ms
Execution Time: 312.558 msThe database read 1.8M rows to find 1,247 matches. 312ms for a simple lookup.
Good - Index scan:
Index Scan using idx_orders_user_created on orders (cost=0.43..52.41 rows=10 width=124) (actual time=0.021..0.089
rows=10 loops=1)
Index Cond: (user_id = 123)
Planning Time: 0.112 ms
Execution Time: 0.107 msIndex jumped straight to the matching rows. 0.1ms.
Look for:
- Seq Scan (sequential scan): Bad for large tables. Means no index is being used.
- Index Scan: Good. Using an index.
- Index Only Scan: Even better. All data comes from the index, no table access needed.
- Nested Loop vs. Hash Join: Different join strategies with different performance characteristics.
The most common problem: missing indexes. If you’re filtering by user_id, you need an index on user_id.
CREATE INDEX idx_orders_user_id ON orders(user_id);Compound indexes for queries with multiple filters:
-- Query: WHERE user_id = ? AND status = ?
CREATE INDEX idx_orders_user_status ON orders(user_id, status);Column order matters. The index can be used for queries that filter on just user_id, but not for queries that filter on just status.
Covering indexes include all columns needed by the query, eliminating the need to access the table:
-- Query: SELECT id, total FROM orders WHERE user_id = ?
CREATE INDEX idx_orders_user_covering ON orders(user_id) INCLUDE (id, total);Now the database can satisfy the query entirely from the index.
Type mismatch gotcha: If orders.user_id is UUID but users.id is TEXT (or vice versa), your JOIN will be slow even with indexes on both columns. PostgreSQL can’t use an index when it needs to cast types - it falls back to a sequential scan. Check that JOIN columns have identical types, not just compatible ones.
Remember about the cost: indexes slow down writes (every INSERT/UPDATE must update all indexes) and use disk space. Don’t index every column - index the columns used in WHERE, JOIN, and ORDER BY clauses of frequent queries.
Read Replicas: Scaling Reads
If you’re read-heavy (most applications are), read replicas let you scale horizontally. The primary database handles writes, replicas handle reads.
The challenge: replication lag. Replicas are eventually consistent - they might be seconds or minutes behind the primary. This creates Read-Your-Writes problems: user updates their profile, immediately fetches it, sees old data because they hit a replica.
Solutions:
- Route critical reads to primary - After writes, read from primary for N seconds.
- Session affinity - Stick users to the same replica after writes.
- Explicit versioning - Client sends “read version ≥ X” parameter.
For most use cases, a few seconds of replication lag is acceptable. Analytics dashboards, search, reporting - these can tolerate staleness. User-facing features after writes need careful handling.
Connection Pooling: Size It Right
Database connections are expensive to create (TCP handshake, authentication, memory allocation on the database server). Connection pooling reuses connections across requests instead of creating new ones.
The counterintuitive part: more connections isn’t better. Each connection consumes memory on the database server, and when you have more active connections than CPU cores, the database spends time context-switching between them instead of doing useful work. A database with 100 active connections will often be slower than one with 20.
The PostgreSQL wiki suggests: connections = (core_count × 2) + effective_spindle_count. The “spindle count” is a legacy term from spinning disks - for SSDs, treat it as zero. So for an 8-core database: roughly 16 connections total, not per application instance. If you have 4 application instances, that’s 4 connections each.
In practice, start with 10-20 connections per application instance for most cloud databases. Monitor connection wait time - if requests are queuing for connections, you need more. But if database CPU is maxed, adding connections makes things worse; you need to optimize queries or scale the database.
Use a connection pooler: HikariCP (Java), pgBouncer (PostgreSQL), or your framework’s built-in pool. Configure max connections, idle timeout (close idle connections after N seconds), and max lifetime (recycle connections periodically to prevent leaks).
Denormalization: When to Break the Rules
Database normalization reduces redundancy but increases joins. Sometimes it’s faster to denormalize - store redundant data to avoid joins.
Example: storing customer_name on the orders table instead of joining to customers table. This makes order queries faster at the cost of duplicating customer names.
When to denormalize:
- High-frequency queries that require multiple joins
- Data that rarely changes
- Read-heavy workloads where write cost doesn’t matter
- Point-in-time snapshots - store the shipping address on the order, not a reference to the customer’s current address. You want the address as it was when they ordered, not whatever it is today.
The trade-off: data can become inconsistent if you’re not careful about which data should update and which shouldn’t. You need a strategy for keeping denormalized data in sync.
Materialized views are a database-native form of denormalization: pre-computed query results stored as a table.
CREATE MATERIALIZED VIEW user_order_summary AS
SELECT user_id, COUNT(*) as order_count, SUM(total) as total_spent
FROM orders
GROUP BY user_id;
CREATE INDEX ON user_order_summary(user_id);Now querying user order summaries is fast - it’s just an index scan on the materialized view. Refresh the view periodically:
REFRESH MATERIALIZED VIEW user_order_summary;The cost: materialized views are eventually consistent (stale until refreshed) and add storage.
Partitioning: Breaking Up Large Tables
When a table grows to millions or billions of rows, queries slow down even with good indexes. Partitioning splits one logical table into multiple physical tables.
Range partitioning by date:
CREATE TABLE orders (
id BIGSERIAL,
user_id INT,
created_at TIMESTAMP,
...
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');Queries that filter by created_at only scan relevant partitions. Old partitions can be archived or dropped.
Hash partitioning by user_id:
CREATE TABLE orders (...)
PARTITION BY HASH (user_id);
CREATE TABLE orders_p0 PARTITION OF orders
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE orders_p1 PARTITION OF orders
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
...This spreads load across partitions, reducing hot spots.
The complexity: partition pruning (database needs to determine which partitions to scan) and maintenance (creating new partitions, managing old ones).
Part 7: Concurrency, Parallelism, and Asynchrony
Synchronous, serial processing is simple but slow. Making operations concurrent, parallel, or asynchronous can dramatically improve throughput and latency.
Moving Work Off the Critical Path
The critical path is the sequence of operations that must complete before returning a response. Everything else should happen asynchronously.
Example: user signs up. Critical path: validate email, check username availability, create account, return success. Non-critical: send welcome email, update analytics, trigger onboarding workflow.
Move non-critical work to background jobs. Use queues (SQS, Kafka) to decouple request handling from background work.
Fan-Out/Fan-In: Parallel Requests
When you need data from multiple sources, don’t fetch them serially - fetch in parallel.
// Serial: 300ms total (100ms × 3)
user := fetchUser(userID) // 100ms
orders := fetchOrders(userID) // 100ms
preferences := fetchPreferences(userID) // 100ms
// Parallel: 100ms total (all requests in parallel)
var wg sync.WaitGroup
var user User
var orders []Order
var preferences Preferences
wg.Add(3)
go func() {
user = fetchUser(userID)
wg.Done()
}()
go func() {
orders = fetchOrders(userID)
wg.Done()
}()
go func() {
preferences = fetchPreferences(userID)
wg.Done()
}()
wg.Wait()Concurrency vs parallelism: Concurrency is handling multiple tasks (potentially interleaved). Parallelism is executing multiple tasks simultaneously. You can have 1000 concurrent goroutines, but if you have 8 CPU cores, at most 8 are executing at any instant.
For CPU-bound work, more concurrency than cores doesn’t help - it adds scheduler overhead as the runtime constantly switches between goroutines. For I/O-bound work (waiting on network, disk), high concurrency helps because goroutines yield while waiting.
But even for I/O-bound work, unlimited concurrency creates problems. Spawning 1000 goroutines to fetch 1000 items will overwhelm downstream services (they have connection limits too), exhaust file descriptors, and thrash memory. Use bounded concurrency - a worker pool with 10-50 workers processing items from a queue.
Handle partial failures gracefully. If 2 of 3 requests succeed, can you return partial data? Or must you fail the entire request?
Background Job Processing: Best Practices
Idempotent job handlers: Jobs may be retried (worker crash, timeout, network error). Ensure jobs can be run multiple times safely.
Job timeouts: Don’t let jobs run forever. Set a timeout. If the job hasn’t completed, kill it and retry.
Dead letter queues: After N retries, move jobs to a DLQ for manual investigation. Don’t retry forever.
Priority queues: Critical jobs should jump ahead of low-priority jobs (analytics updates).
Monitoring: Track queue depth, job processing time, failure rate. Alert when queues back up or failure rates spike.
Exactly-Once Effects with Outbox Pattern
When you need to both update state and publish an event (e.g., create order and send confirmation email), you have a consistency problem. If you update the database successfully but publishing the event fails, the order exists but no email is sent.
The outbox pattern solves this: write the state change and the event to the database in one transaction. A separate worker reads from the outbox and publishes events.
-- In a transaction:
INSERT INTO orders (user_id, total, status) VALUES (?, ?, 'pending');
INSERT INTO outbox (event_type, payload) VALUES ('order.created', '{"order_id": ...}');
COMMIT;
-- Separate worker:
SELECT * FROM outbox WHERE published_at IS NULL ORDER BY created_at;
-- Publish each event
-- Mark as published
UPDATE outbox SET published_at = NOW() WHERE id = ?;This guarantees that if the order is created, the event will eventually be published. If the transaction fails, neither happens.
Part 8: Rate Limiting, Backpressure, and Load Shedding
Systems fail under excessive load. Rate limiting, backpressure, and load shedding are mechanisms for degrading gracefully instead of falling over.
Rate Limiting: Protecting Your API
Rate limiting restricts how many requests a client can make in a time window. This protects against abuse (malicious or accidental) and ensures fair resource allocation.
Per-user rate limits: 1000 requests/hour per API key. Prevents one user from monopolizing resources.
Per-IP rate limits: 100 requests/minute per IP. Prevents DDoS and brute force attacks.
Global rate limits: 100,000 requests/second across all users. Prevents total system overload.
Token bucket algorithm: Each client has a bucket of tokens. Each request consumes a token. Tokens regenerate at a fixed rate. When the bucket is empty, requests are rate limited.
Leaky bucket algorithm: Similar but smoother. Requests drain from the bucket at a constant rate.
Make sure to return 429 Too Many Requests with a Retry-After header telling clients when they can retry.
Backpressure: Pushing Back on Clients
When your system is overloaded, don’t accept more work - push back. This is backpressure.
Techniques:
- Return 503 Service Unavailable: Tells clients “I’m overloaded, try again later.”
- Increase response time: Slow down processing to reduce load (not recommended - better to reject).
- Queue depth limits: If your job queue hits 10,000 items, stop accepting new jobs.
The key: fail fast and explicitly rather than accepting work you can’t handle.
Load Shedding: Dropping Low-Priority Work
Under extreme load, shed low-priority requests to protect high-priority ones.
Example: Under normal load, serve both product pages and recommendations. Under high load, serve product pages but return cached/stale recommendations or skip them entirely.
Implement this with feature flags and priority queues. Critical requests bypass rate limits. Non-critical requests are shed first.
Circuit Breakers: Preventing Cascading Failures
When a dependency fails, don’t keep calling it indefinitely. Without circuit breakers, requests pile up waiting for timeouts from the failing service. Your connection pool fills up, threads block, and suddenly your entire API is slow - even endpoints that don’t use the failing dependency.
Circuit breakers detect failures and stop sending requests.
States:
- Closed: Normal operation. Requests flow through.
- Open: Too many failures. All requests fail immediately without calling the dependency.
- Half-open: After a timeout, try one request. If it succeeds, close the circuit. If it fails, re-open.
This prevents cascading failures. If your payment provider is down, your entire API doesn’t need to be down - return “payment temporarily unavailable” instead of timing out.
Examples of libraries: Hystrix (Java), gobreaker (Go), resilience4j (JVM).
Part 9: Horizontal Scaling and Statelessness
The easiest way to handle more load is to run more servers. Plan your architecture from the start to support this - retrofitting statelessness into a stateful system is painful. The goal: any request can be handled by any server instance, and you can add or remove instances without coordination.
Stateless Service Design
A stateless service stores no session state on the server. Every request contains all the information needed to process it (auth token, user ID, etc.).
This means any server can handle any request. Load balancers can distribute requests evenly. Servers can be added or removed without affecting users.
Where to put session state:
- Client-side: Cookies, JWTs. Best for small amounts of data (user ID, permissions).
- External store: Redis, Memcached. For larger session data (shopping cart, user preferences).
- Database: For persistent state that must survive server restarts.
Avoid server-side memory for session state. As soon as you do that, you need sticky sessions (route users to the same server), which limits scaling and complicates deployments.
Load Balancing Strategies
Round-robin: Distribute requests evenly across servers. Simple, works well if all servers are equal and all requests are similar.
Least connections: Route to the server with fewest active connections. Better than round-robin if requests have variable duration.
Consistent hashing: Hash user ID to determine which server handles their requests. Provides cache affinity (same user hits same server, so caching is more effective). Trade-off: uneven load distribution.
Weighted routing: Assign weights to servers based on capacity. New servers get 10% traffic initially, gradually increase to 100% as they warm up.
For most applications, round-robin or least-connections is sufficient. Use consistent hashing only if cache affinity matters significantly.
Auto-Scaling: Adding Capacity Automatically
Auto-scaling adds/removes servers based on load. Trigger on metrics like:
- CPU utilization: Scale up if CPU >70% for 5 minutes.
- Request rate: Scale up if requests/second >10,000.
- Queue depth: Scale up if job queue depth >1000.
Set cooldown periods to prevent thrashing (scaling up and down rapidly). Scaling up should be faster than scaling down (better to over-provision briefly than under-provision).
Predictive scaling: Pre-scale before known traffic spikes (product launches, marketing campaigns). Don’t wait for autoscaling to react.
The cost of cold starts: new servers take time to warm up (load code, establish connections, fill caches). Keep a warm pool of spare instances to reduce cold start time.
Part 10: External Dependencies and Failure Handling
Your API is only as reliable as its slowest, least reliable dependency. External APIs fail, timeout, rate limit, or return errors. Without proper handling, a single failing dependency ties up threads waiting for timeouts, exhausts connection pools, and degrades performance for all requests - even those that don’t touch the failing service.
Circuit Breakers for Dependencies
We covered circuit breakers earlier. Consider them for critical external dependencies, especially those with a history of instability. When a dependency fails repeatedly, stop calling it and return a fallback response.
Timeouts: Be Aggressive
Set timeouts on all external calls. Default timeouts (30s or infinite) are almost always wrong. If your SLO is 200ms, you can’t wait 30 seconds for a dependency.
Set timeouts based on your latency budget. If you have 200ms to respond and you call 3 services, each service gets ~60ms timeout (allowing 20ms for your own processing).
Propagate timeouts with context deadlines (Go) or similar mechanisms in other languages. If the user’s request has 200ms remaining, all downstream calls inherit that deadline.
Retries: Be Careful
Retries can cause retry storms - if a service is struggling, retries amplify the load and push it over the edge. Every client retrying simultaneously turns a partial outage into a complete one.
Retry only idempotent operations. GET requests are safe to retry. POST/PUT/DELETE require idempotency keys.
Use exponential backoff with jitter. Don’t retry immediately. Wait 100ms, then 200ms, then 400ms - depending on the latency budget available. Add jitter so retries don’t synchronize.
Limit total retry attempts. Don’t retry forever. After a few attempts, fail and let the circuit breaker handle it.
Retry budgets: Limit the percentage of requests that can be retries. If more than 10% of your traffic is retries, stop retrying - the system is degraded.
Fallback Strategies
When a dependency fails, what do you return?
Stale data: Serve cached data even if it’s expired. Better to show slightly stale product prices than an error.
Default values: If the recommendations service is down, show popular items instead of personalized recommendations.
Degraded experience: If payment processing is down, allow users to place orders but warn “payment processing delayed.”
Partial responses: If fetching 10 items and 1 fails, return the 9 that succeeded.
The key: design for degraded operation from the start. Don’t add error handling after launch - by then your API contracts are inflexible.
Hedged Requests
For reads where latency matters more than resource usage, send duplicate requests to multiple backends and use whichever responds first.
Example: Query 3 database replicas simultaneously. Most of the time, they all respond quickly. Occasionally, one is slow (GC pause, disk contention). By querying all 3, you get the fastest response.
The trade-off: 3x resource usage for a modest latency improvement (usually 5-20% improvement in p99). Worth it for critical read paths where latency is paramount.
Part 11: High-Scale Patterns
At web scale (millions of requests/second, terabytes of data), the patterns that worked at smaller scale stop working. You need different architectures.
CQRS: Separate Read and Write Models
Reads and writes have different requirements - consistency guarantees, scaling characteristics, latency tolerances, data shapes. Forcing both through one model means optimizing for neither.
CQRS separates them. Your write path can prioritize consistency and validation. Your read path can prioritize speed - denormalized, cached, indexed for your specific query patterns, maybe in a completely different store. An async process keeps them in sync.
The cost: operational complexity and consistency lag. Users might not see their write reflected in search results for a few seconds. Worth it when you’re read-heavy (10:1+) and query complexity is killing performance. Overkill for simple CRUD where a few indexes solve the problem.
Event Sourcing
Append-only writes are faster than updates - no locking, no constraint checks against existing rows. Event sourcing leans into this: store state changes as immutable events, derive current state from projections built off those events.
The performance win is twofold. Writes are cheap appends. Reads come from projections you control - build multiple projections optimized for different query patterns, rebuild them when requirements change. Audit trails and time travel come as a bonus.
The costs: complexity, storage growth (events are never deleted), and you need snapshotting to avoid replaying millions of events. Worth it when write throughput matters and you need flexible read optimization. Overkill for most CRUD apps.
Sharding: Splitting Data Across Databases
When one database can’t handle the load or data size, shard: split data across multiple databases.
Range-based sharding: Users 0-1M on DB1, 1M-2M on DB2, etc. Simple but creates hot spots (most writes go to the latest range).
Hash-based sharding: Hash user ID, route to DB based on hash. Even distribution but can’t do range queries across shards.
Tenant-based sharding: Each customer’s data on separate DB. Perfect isolation but hard to rebalance.
Sharding adds complexity: cross-shard queries are expensive, transactions across shards are hard, rebalancing shards is operational pain.
When to use: When you’ve exhausted vertical scaling (bigger database server) and read replicas aren’t enough.
Multi-Region Deployments
For global applications, serving users from multiple regions reduces latency. But multi-region adds complexity.
Active-passive: One region serves all traffic. Other regions are standby for disaster recovery. Simple, but cross-region latency for most users.
Active-active: All regions serve traffic. Route users to nearest region. Better latency, but requires data replication and conflict resolution.
Data locality: EU users’ data stays in EU (GDPR). US users’ data stays in US. Requires partitioning data by geography and careful routing.
The hard part: consistency. If a user writes in US and reads in EU, how do you ensure they see their write? Options: synchronous replication (slow), eventual consistency with conflict resolution (complex), or sticky routing (route users to one region for writes).
Part 12: Edge Computing and Geographic Performance
The speed of light is your ultimate constraint. A round trip from US to Europe takes 80ms minimum. Caching at the edge can eliminate these trips.
CDN Strategy
Content Delivery Networks cache content at edge locations close to users. This works great for static assets (images, JS, CSS) and cacheable API responses.
What to cache: Static content (forever), semi-static content (product catalogs with 5-minute TTL), personalized content (with user-specific cache keys).
Cache keys: Include all factors that affect the response. If response varies by user, include user ID in cache key. If it varies by region, include region.
Signed URLs: For private content, generate time-limited signed URLs. The CDN caches content but only serves it to users with valid signatures.
Cache hit optimization: Normalize cache keys (lowercase, sorted query params) to maximize hits. Use consistent URL schemes. Enable query string whitelisting (only cache on specific params).
Routing: Use latency-based routing over geolocation routing - closest region isn’t always fastest due to network topology. Route53, Cloudflare, and most CDNs support this.
Edge Compute
Modern CDNs (Cloudflare Workers, Lambda@Edge, Fastly Compute@Edge) let you run code at the edge. This enables:
Personalization: Customize responses based on user’s location, device, preferences without a full backend round trip.
A/B testing: Route users to different variants at the edge.
Auth checks: Validate JWTs at the edge, reject unauthorized requests without hitting your backend.
Request routing: Route requests to different backends based on user, region, or feature flags.
The limitations: cold starts (though improving), execution time limits (typically 50-500ms), limited memory, limited dependencies.
When it works: Simple logic (auth, routing, personalization). When it doesn’t: Complex business logic, database queries, long-running operations.
Part 13: Runtime and Infrastructure Tuning
Sometimes the problem isn’t your code - it’s how it’s running.
Language-Specific Tuning
Go: Set GOMAXPROCS to match CPU cores. Watch for goroutine leaks. Monitor GC pauses. Use pprof for CPU and memory profiling.
Node.js: Avoid blocking the event loop. Use worker threads for CPU-intensive tasks. Cluster mode to use multiple CPU cores. Watch for memory leaks in callbacks.
Python: The GIL limits multi-threading for CPU-bound tasks. Use multiprocessing instead. Async/await for I/O-bound tasks. Profile with py-spy or cProfile.
Container and Kubernetes Tuning
CPU and memory limits: Set both requests (guaranteed resources) and limits (maximum resources). Too low: throttling and OOMKills. Too high: wasted resources.
Resource quotas: In Kubernetes, set CPU and memory at the namespace level to prevent resource exhaustion.
HPA (Horizontal Pod Autoscaler): Scale pods based on CPU, memory, or custom metrics. Set appropriate thresholds (70-80% CPU) and cooldown periods.
Service Mesh Trade-offs
Service meshes (Istio, Linkerd) add a sidecar proxy to every pod. This provides:
- mTLS between services
- Observability (automatic tracing and metrics)
- Traffic management (retries, timeouts, circuit breakers)
The cost: added latency (every request goes through proxy), resource overhead (sidecars use CPU/memory), operational complexity.
Sidecar per pod: Traditional service mesh. Adds 1-5ms latency per hop.
Ambient mesh: Newer approach with shared proxies instead of per-pod sidecars. Reduces resource overhead.
Mesh-less: Use libraries for observability and resilience instead of proxies. Zero latency overhead but less flexibility.
Service meshes are worth it if you need what they provide (mTLS, observability, traffic management). But budget for the latency overhead and learn to tune it - DestinationRules, connection pooling, outlier detection. A misconfigured mesh is slower than no mesh.
Part 14: Security Without Slowing Down
Security and performance are often at odds. Every security check adds latency. The goal is to make security cheap enough to not hurt performance.
TLS Optimization
TLS handshakes are expensive (full handshake takes 2 RTTs). Optimizations:
Session resumption: Reuse TLS sessions across connections. This skips the expensive handshake. Enable session tickets on your servers.
OCSP stapling: Instead of clients checking certificate revocation online (adds RTT), the server includes the OCSP response. Reduces latency by 100-200ms.
Certificate chain optimization: Keep certificate chains short. Each additional certificate in the chain adds payload size and validation time.
TLS termination at the edge: Terminate TLS at the load balancer or CDN, use plain HTTP internally. This reduces per-request TLS overhead on application servers.
Authentication and Authorization
JWTs are self-validating - that’s the point. Cache the signer’s public key in memory and validate signatures locally. No network call per request, no auth service hot spot. Refresh the key periodically (e.g. hourly) or when validation fails.
Avoid calling an external JWKS endpoint or internal auth service on every request. This creates a bottleneck that becomes increasingly painful as your traffic grows - exactly what you don’t want.
Token lifetimes: Balance security vs. refresh overhead. Shorter tokens (15-30 min) mean more refresh traffic but lower risk if compromised. Longer tokens (1-2 hours) mean less overhead but bigger blast radius.
Authorization: If permissions live in an external service, cache decisions. A few minutes of staleness is usually acceptable. The alternative - calling authZ on every request - creates the same hot spot problem.
WAF and Bot Mitigation
Web Application Firewalls (WAFs) inspect every request for attacks. This adds latency (1-5ms typically) but prevents attacks.
Run WAF at the edge (CloudFlare, Cloudfront) instead of on your application servers. This reduces latency and offloads CPU.
Bot mitigation (CAPTCHAs, fingerprinting) should happen at the edge too. Most bot traffic never reaches your application servers.
Final Thoughts
Performance optimization is a craft. You get better with practice, pattern recognition, and learning from mistakes. Build with performance in mind from the start - connection pooling, timeouts, cacheable APIs. These decisions are cheap early and expensive later. But don’t optimize without data. Measure first. The bottleneck is rarely where you think it is.
Work with your team. Performance isn’t just the backend team’s job. Product decides which features are worth the performance cost. Frontend decides how many API calls to make. Mobile decides when to fetch data. DevOps decides how to scale infrastructure. Everybody’s decisions affect performance.
Be pragmatic. An API that responds in 100ms with occasionally stale data beats one that responds in 500ms with perfect freshness. Know your trade-offs.
Users don’t care about your p50 or your architecture. They care whether it feels fast. Sometimes that’s a faster API. Sometimes it’s a smarter UI. Performance is a feature - treat it like one.



