The Cloud Cost Optimization Playbook

Part 1: Foundation & Strategy

1. Introduction: Why Cloud Costs Spiral (And How to Fix It)

Cloud costs spiral because creation and destruction are asymmetric. Provisioning takes one CLI command. Deleting requires archaeology: which team owns this? Is anything using it? Who left and took the context with them? Engineers are rewarded for shipping, not cleaning up. So resources accumulate, ownership decays, and the bill compounds.

The lifecycle is predictable: engineer provisions infrastructure, project ships or dies, engineer moves teams or leaves. Resources remain. After two reorgs, provenance is lost. Multiply by headcount and years, and you’ve got a graveyard nobody’s authorized to clean.

Storage has no natural predator. Compute has feedback loops - utilization metrics, rightsizing recommendations, capacity planning. Storage just sits there, accruing silently, until someone discovers you’re paying $30k/month for logs from a service decommissioned in 2021.

The Two-Phase Framework

When it comes to cost cutting, I recommend a two-phase approach.

Phase 1: Low-Hanging Fruit. Optimize what you have without changing how systems work. Commitment-based pricing (30-70% off for stable workloads), rightsizing overprovisioned instances, lifecycle policies for storage, cleaning up orphaned volumes and snapshots. One company I worked with found $15k/month in zombie resources just by running cleanup scripts.

Phase 2: Rethink & Refactor. Challenge assumptions. Define retention policies instead of keeping everything forever. Move always-on workloads to serverless. Consolidate five Redis clusters into one. Ask whether dev really needs to run 24/7. This phase requires stakeholder buy-in - product, legal, data governance - because you’re changing what you provide, not just how you provide it.

In my experience, 80% of savings come from 20% of the effort - Phase 1 work you can pick up between other tasks. Phase 2 matters, but it requires buy-in, and buy-in comes after you’ve already proven the wins.

2. Step 0: Get Your House in Order

Before you can optimize anything, you need to know what you’re paying for. This starts with tagging. You can’t optimize what you can’t attribute. Once you know who owns something, you can investigate whether it’s still needed and to what extent.

Why Tagging Matters

Without proper tags, your cost reports are useless. You’ll see “$47,000 spent on EC2” but have no idea which team, product, or environment is responsible. You can’t build accountability, you can’t do chargebacks, and you can’t prioritize optimization efforts. Tagging turns your billing data from a single intimidating number into actionable intelligence.

Every resource should carry metadata that answers these questions:

Environment: Is this prod, staging, dev, or someone’s sandbox?
Team/Owner: Who’s responsible when this goes wrong or costs too much?
Service/Application: Which product or microservice does this support?
Cost Center: Which business unit should be charged?
Geography/Region: Any location or compliance requirements?

Enforcing Standards (Because Engineers Will Forget)

Tagging policies only work if they’re enforced programmatically.

In AWS, use Tag Policies within AWS Organizations to require specific tags at resource creation. If someone tries to launch an EC2 instance without a Team tag, the API call fails. It’s harsh, but it works.

In GCP, use Organization Policy constraints with the constraints/gcp.resourceLocations and label requirements. You can mandate that every resource must have labels for env and team before it can be created.

You can soften the blow by starting with a grace period - e.g. 30 days where tagging is required but unenforced - then flip the switch. Announce it loudly, update your IaC templates, and prepare for the Slack complaints.

Common Tagging Anti-Patterns

Things to avoid:

Too many required tags: If you require 15 tags, nobody will comply. Start with 3-4 essential ones.
Inconsistent values: Is it production, prod, or prd? Define allowed values and enforce them.
No automation: Make validated tags a part of your shared modules - don’t expect people to remember to manually tag resources post-factum.

3. Discovery: Finding Where the Money Goes

Both AWS and GCP have native cost tools. The basics:

AWS: Cost Explorer for interactive analysis, Budgets for alerts and anomaly detection, Cost and Usage Report (CUR) for raw data you can query with Athena, Compute Optimizer for rightsizing recommendations.
GCP: Cloud Billing Reports, Recommender for optimization suggestions, Active Assist for proactive alerts.

Start with Cost Explorer or Billing Reports grouped by service over the last 3 months. That alone usually reveals the biggest opportunities.

Building Useful Reports

Most cost dashboards are vanity projects. Someone builds a beautiful visualization in Looker, everyone admires it in the team meeting, nobody looks at it again.

Useful reports answer specific questions:

Which team’s spending grew the most last month?
What are my top 10 most expensive individual resources?
How much am I spending on dev environments that should be shut down at night?
Which services have the worst cost-per-customer ratio?

Build reports that map to decisions. If your report shows “Compute costs: $45k,” that’s useless. If it shows “Team Alpha’s staging environment is costing $8k/month and runs 24/7,” that’s actionable.

The “Top Offenders” Exercise

Run this exercise quarterly: Generate lists of your top offenders across multiple dimensions.

By total cost:

Top 10 AWS accounts or GCP projects
Top 10 services (S3, EC2, RDS, etc.)
Top 10 individual resources (specific buckets, instances, databases)

By growth rate:

Services where month-over-month spending grew >20%
Teams whose costs increased significantly
New resource types you weren’t spending on 6 months ago

The growth rate analysis is often more revealing than absolute cost. A $100k/month RDS bill might be acceptable if it’s been stable for a year and supports your core product. A $5k/month service that was $500 three months ago is a red flag.

Prioritization: Effort vs. Savings Matrix

You can’t optimize everything at once, so build a prioritization matrix. For each optimization opportunity, estimate:

Savings potential: How much money will this save per month/year?
Implementation effort: How many engineering hours will this take? (Use t-shirt sizes: S/M/L)
Risk level: What could break if this goes wrong? (Low/Medium/High)

Plot these on a simple 2x2 matrix: Effort (x-axis) vs. Savings (y-axis). Your priorities are the high-savings, low-effort quadrant. These are your quick wins.

Common examples that fall into this quadrant:

Enabling lifecycle policies on S3 buckets (5 minutes of work, potentially thousands in savings)
Deleting unattached EBS volumes (one-line AWS CLI command, immediate savings)
Stopping dev/staging instances at night (simple Lambda function, 60-70% cost reduction)

Save the high-effort, high-risk optimizations for later, after you’ve built credibility with the quick wins.

Part 2: Service-Specific Playbooks

The following sections apply the two-phase framework to each major cost center. Phase 1 for each service covers quick wins; Phase 2 covers the harder architectural questions.

1. Object Storage (S3 / Cloud Storage)

Object storage is deceptive. It feels cheap - a few cents per gigabyte - until you realize you’re storing 500TB of data you haven’t touched in 2 years. S3 and Cloud Storage costs sneak up on teams because storage accumulates by default and deletion requires active effort.

Phase 1: Lifecycle Policies and Intelligent Tiering

The fastest win with object storage is enabling lifecycle policies. These automatically transition objects to cheaper storage classes based on age or access patterns, or delete them entirely after a retention period.

Start by identifying your most expensive buckets. In AWS, use S3 Storage Lens to see bucket-level costs and access patterns:

aws s3api list-buckets --query 'Buckets[].Name' | xargs -I {} \
  aws s3api get-bucket-location --bucket {}

Then check CloudWatch metrics for each bucket to see size and request patterns. You’re looking for large buckets with infrequent access - those are prime candidates for tiering.

A typical lifecycle policy for application logs might look like this:

Keep in Standard storage for 30 days (frequent access for debugging recent issues)
Transition to Infrequent Access (IA) after 30 days (occasional access for historical analysis)
Transition to Glacier after 90 days (rare access, compliance-driven retention)
Delete after 1 year (unless legal/compliance says otherwise)

Most teams set up lifecycle policies once and never revisit them. They’ll have a policy that says “transition to Glacier after 30 days” even though nobody has ever retrieved anything from Glacier because the restore time is too slow for their use case. You end up paying for Glacier storage AND paying retrieval fees when someone inevitably needs that data.

S3 Intelligent-Tiering solves this by monitoring access patterns and automatically moving objects between tiers. It costs a bit extra (a monitoring fee per 1000 objects), but it’s worth it for buckets where access patterns are unpredictable:

resource "aws_s3_bucket_intelligent_tiering_configuration" "entire_bucket" {
  bucket = aws_s3_bucket.logs.id
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

In GCP, the equivalent is Autoclass for Cloud Storage buckets, which automatically adjusts storage classes based on access. Simpler than S3’s tiering because GCP has fewer storage classes:

resource "google_storage_bucket" "logs" {
  name     = "your-bucket-name"
  location = "US"

  autoclass {
    enabled                = true
    terminal_storage_class = "ARCHIVE"
  }
}

Common mistake: Applying lifecycle policies without checking access patterns first. I’ve seen teams transition objects to Glacier, then pay 10x in retrieval costs because their application actually accesses that data regularly. Use S3 Storage Lens or GCS Storage Insights to understand access patterns before you tier.

Another gotcha: incomplete multipart uploads. When a multipart upload fails halfway through, S3 keeps the incomplete parts around indefinitely, charging you storage. Set up a lifecycle rule to delete them:

resource "aws_s3_bucket_lifecycle_configuration" "cleanup" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "delete-incomplete-uploads"
    status = "Enabled"

    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }
}

Phase 2: Data Retention and Pruning Strategies

Phase 2 is about asking the hard question: do we need to store this at all?

Most engineering teams are data hoarders. “We might need it for analysis someday” or “what if a customer asks for historical data?” So you keep everything. Three years of application logs. Every version of every ML model. Debug snapshots from that incident in 2021. And you’re paying for all of it.

Building a retention policy:

First, inventory what you have. Group your storage by purpose: application logs, database backups, ML datasets, user uploads, build artifacts, etc. For each category, find the data owner - the team or person who can make retention decisions.

Then, ask three questions:

Legal/compliance requirements: What does the law or your industry regulations require you to keep? Healthcare might be 7 years, financial services might be 6. Most tech companies have no mandated retention periods - just obligations to honor deletion requests (GDPR, CCPA).
Operational necessity: How far back do you actually look when debugging issues or analyzing trends? Most teams say “we need a year of logs” but in reality they only look at the last 30 days.
Storage cost vs. retrieval likelihood: If it costs $5k/month to store something you’ve never accessed, delete it. If it costs $5k/month and you access it weekly, keep it (but maybe optimize the access pattern).

A reasonable retention policy for most SaaS applications:

Application logs: 30 days in hot storage (Standard), 90 days in warm storage (IA/Nearline), delete after 6 months unless compliance requires longer
Database backups: 30 daily backups, 8 weekly backups, 12 monthly backups, then delete
ML training data: Keep the last 5 versions of datasets, archive anything older than 6 months
Build artifacts: Keep the last 10 successful builds per branch, delete everything else after 90 days

Implementing this requires both automation and communication. Use lifecycle policies for the automated deletions, but send notifications before you delete anything significant:

# Example: Find S3 objects older than 5 years and notify owners
import boto3
from datetime import datetime, timedelta

s3 = boto3.client('s3')
cutoff = datetime.now() - timedelta(days=365 * 5)

for bucket in ['logs-bucket', 'backups-bucket']:
    objects = s3.list_objects_v2(Bucket=bucket)
    old_objects = [
        obj for obj in objects.get('Contents', [])
        if obj['LastModified'].replace(tzinfo=None) < cutoff
    ]

    if old_objects:
        print(f"Bucket {bucket}: {len(old_objects)} objects older than 5 years")
        # Send notification to bucket owner
        # Schedule deletion in 30 days if no objection

The hardest part isn’t the technical implementation - it’s getting teams to agree that deletion is okay. You’ll encounter “but what if…” objections constantly. Document the retention policy, get leadership sign-off, and be prepared to point to it when people push back.

One more thing: deduplicate before you delete. If you’re storing multiple copies of the same data (common with backup systems or ML pipelines), dedupe first and you might cut storage by 30-50% without deleting anything. Tools like rclone dedupe or S3 Batch Operations with checksums can help.

2. Compute (EC2, ECS, EKS, Lambda / GCE, GKE, Cloud Run, Functions)

Compute is usually your biggest cost center. It’s also where optimization has the most impact - cutting compute by 30% can save hundreds of thousands annually for mid-sized teams.

Phase 1: Rightsizing, Scheduling, and Commitments

Rightsizing means matching instance size to actual usage.

Run AWS Compute Optimizer to get rightsizing recommendations:

aws compute-optimizer get-ec2-instance-recommendations \
  --max-results 100 \
  --query 'instanceRecommendations[?finding==`Overprovisioned`].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].estimatedMonthlySavings.value]' \
  --output table

The output will look like this:

----------------------------------------------------------------
|              GetEC2InstanceRecommendations                    |
+------------------+----------+--------------+------------------+
| Instance ARN     | Current  | Recommended  | Monthly Savings  |
+------------------+----------+--------------+------------------+
| arn:aws:ec2:...  | m5.2xl   | m5.large     | \$187.32         |
| arn:aws:ec2:...  | c5.4xl   | c5.xlarge    | \$312.50         |
+------------------+----------+--------------+------------------+

What you’re looking at: Compute Optimizer analyzed CPU, memory, network, and disk metrics over the last 14 days and determined these instances are way overprovisioned. That m5.2xlarge could be an m5.large (1/4 the size, 1/4 the price) without performance degradation.

It’s easy to disregard the recommendations on the basis that the overprovisioning accounts for potential traffic spikes. But that’s what autoscaling is for. Use metric-based scaling (CPU, memory, or RPS - whichever is your actual bottleneck) for reactive scaling, or scheduled scaling if you have predictable traffic patterns and need capacity ready before the spike hits. Either way, smaller instances + autoscaling beats permanently overprovisioned instances.

For GCP, use the Recommender:

gcloud recommender recommendations list \
  --project=your-project \
  --location=us-central1 \
  --recommender=google.compute.instance.MachineTypeRecommender \
  --format="table(name,primaryImpact.costProjection.cost.units,recommenderSubtype)"

Commitment-based discounts are free money if you have predictable workloads. Cloud providers give you massive discounts (30-70%) in exchange for committing to a certain level of usage for 1-3 years.

AWS offers two mechanisms:

Savings Plans: Commit to spending $X/hour on compute. More flexible than RIs because they apply across instance families, regions, and even to Lambda/Fargate. If you commit to $100/hour and use $120, you pay on-demand for the extra $20.
Reserved Instances: Capacity reservations for specific instance types in specific regions. Less flexible but slightly higher discounts in some cases.

The trick is finding the right commitment level. Go too high and you’re locked in; go too low and you miss savings. Look at your minimum baseline usage over the last 6 months:

aws ce get-cost-and-usage \
  --time-period Start=2024-04-01,End=2024-10-01 \
  --granularity DAILY \
  --metrics UnblendedCost \
  --filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Compute Cloud - Compute"]}}'

Find the 10th percentile - that’s your safe commitment level. If your 10th percentile hourly spend is $80, buy a Savings Plan for $70-75/hour. You’ll cover 85-90% of usage with discounts and pay on-demand for spikes.

GCP’s equivalent is Committed Use Discounts (CUDs). These work a bit differently - you commit to a specific amount of vCPU and memory for 1 or 3 years:

gcloud compute commitments create my-commitment \
  --region=us-central1 \
  --resources=vcpu=100,memory=400GB \
  --plan=12-month

Scheduling for non-production environments is and easy optimization win. Not every environment needs to run 24/7. If you shut them down from 10pm to 6am weekdays and all weekend, you’re running 80 hours/week instead of 168 - that’s a 50% cost reduction.

Set up a simple Lambda (AWS) or Cloud Function (GCP) that runs on a schedule:

import boto3
from datetime import datetime

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    # Stop instances tagged with Environment=dev at 7pm
    if datetime.now().hour == 19:
        instances = ec2.describe_instances(
            Filters=[{'Name': 'tag:Environment', 'Values': ['dev', 'staging']}]
        )
        instance_ids = [i['InstanceId'] for r in instances['Reservations'] for i in r['Instances']]
        if instance_ids:
            ec2.stop_instances(InstanceIds=instance_ids)
            print(f"Stopped {len(instance_ids)} instances")

    # Start them at 7am
    elif datetime.now().hour == 7:
        # Similar logic to start instances
        pass

Trigger this with EventBridge (AWS) or Cloud Scheduler (GCP) on a cron schedule.

Attached Resources: When shutting down applications, don’t forget about the related resources. That RDS database costs the same whether the application is running or not. You need to stop the database too, or better yet, use Aurora Serverless which can auto-pause.

Phase 2: Serverless Migrations and Spot Instances

Phase 2 is about rethinking your compute architecture for cost efficiency.

Choosing the right compute model matters more than optimizing within the wrong one:

Serverless (Lambda, Cloud Functions, Cloud Run): Best for spiky traffic, ad-hoc tasks, and workloads that can scale to zero. You pay per invocation, so low-volume or bursty patterns win. High-volume steady traffic loses - per- compute-unit, serverless costs more than EC2, and cold starts add latency.
Containers (ECS, EKS, Cloud Run, GKE): Good middle ground. Better bin-packing than VMs, faster scaling than EC2, cheaper than serverless for sustained load. Requires more operational overhead than serverless.
EC2/VMs: Most cost-effective for steady, predictable workloads - especially with Reserved Instances or Savings Plans. Worst option for variable traffic unless you’ve nailed autoscaling.

The break-even depends on your traffic pattern. A t3.small costs ~$15/month running 24/7. Lambda at 100k invocations/ month with 100ms runtime costs $0.17. But at 10M invocations, Lambda costs $17 - and at 100M, you’re at $170/month for what a single EC2 instance could handle.

Spot instances (AWS) and Preemptible VMs (GCP) offer 60-90% discounts in exchange for accepting that your instance can be terminated with 2 minutes notice. They’re perfect for fault-tolerant workloads: batch processing, CI/CD workers, data pipelines, rendering farms.

Just make sure to design for interruption. Your workload needs to handle being killed mid-execution - by checkpointing progress, using idempotent operations, or being stateless enough to restart cleanly.

Kubernetes cost optimization deserves special mention because EKS/GKE costs are notoriously hard to control.

The most common issue is over-requesting. Engineers copy-paste resource requests from examples or other services, and nobody revisits them. A pod requesting 2GB RAM but using 200MB wastes 1.8GB of schedulable capacity. Multiply across hundreds of pods, and you’re paying for nodes you don’t need.

Use the Vertical Pod Autoscaler (VPA) in recommendation mode to get sizing suggestions without automatic changes. Review the recommendations periodically and adjust requests manually - this avoids surprise restarts while still right-sizing over time. For workloads with variable load, Horizontal Pod Autoscaler (HPA) scales replicas based on CPU, memory, or custom metrics.

Node autoscaling is essential. Without it, you’re either over-provisioned (paying for idle nodes) or under- provisioned (pods stuck in Pending). Use Cluster Autoscaler or Karpenter (AWS).

Finally, consider deployment fragmentation. Every deployment carries overhead - sidecars, base memory, scheduling constraints. Ten microservices consume more resources than five services doing the same work. Sometimes consolidating makes sense, accepting the coupling trade-off for lower infrastructure cost.

3. Relational Databases (RDS, Aurora / Cloud SQL)

Databases are expensive and scary to optimize because nobody wants to be the person who caused downtime or data loss. But they’re also one of the biggest cost centers, so you can’t ignore them.

Phase 1: Instance Sizing and Storage Optimization

Database rightsizing is trickier than compute rightsizing because databases have multiple performance dimensions: CPU, memory, IOPS, network throughput, and storage. You need to look at all of them.

Start with CloudWatch metrics (AWS) or Cloud Monitoring (GCP) for your RDS or Cloud SQL instances. You’re looking for:

CPU utilization <30% consistently → downsize instance type
FreeableMemory >50% of total → downsize
ReadIOPS and WriteIOPS way below provisioned → reduce IOPS tier
Storage utilization <50% → shrink volume

Watch out for provisioned IOPS. Teams often over-provision because the AWS console defaults are high or because someone picked io1/io2 years ago and never revisited it. Check your actual ReadIOPS and WriteIOPS in CloudWatch - if you’re consistently using a fraction of what you’re paying for, switch to gp3 (which includes 3000 IOPS baseline) and save 50-80%.

Storage autoscaling is helpful but dangerous. RDS can automatically increase storage when you’re running low, which prevents outages but also means your costs can creep up without you noticing. Set a maximum storage threshold and alert when you hit 80% of it.

Backup retention is another area of waste. The default RDS backup retention is 7 days, but teams often extend it to 30 or 35 days “just in case.” Each snapshot costs money (roughly the same as the storage it backs up). If you’ve got a 500GB database with 35 days of backups, that’s potentially 17TB of snapshot storage.

Ask yourself: when was the last time you restored from a backup older than 7 days? Adjust retention to match reality.

Phase 2: Archiving and Read Replica Rationalization

Phase 2 for databases is about moving data around and questioning architectural assumptions.

Big tables with historical data are prime candidates for offloading. If you’ve got a transactions table with 5 years of data but only query the last 6 months regularly, move the old data to S3. For RDS PostgreSQL, use the aws_s3 extension; Aurora has native S3 export. Once archived, query it with Athena when needed - at a fraction of the cost of keeping it in RDS.

Read replicas are often overused. Teams create them “for scaling” but then barely use them. If you’ve got 3 read replicas sized identically to your primary, that’s 3x the cost for capacity you might not need.

Audit replica usage by checking connection counts and query volume. If a replica averages <5 active connections and low query volume, you probably don’t need it. Or you could downsize it - read replicas don’t have to be the same size as the primary.

Aurora Serverless (AWS) or Cloud SQL with auto-scaling (GCP) is worth considering for variable workloads. Aurora Serverless automatically scales database capacity up and down based on load, and can even pause during inactivity. For a staging database that’s only used during work hours, this can cut costs by 70-80%.

4. NoSQL (DynamoDB / Firestore, Bigtable)

NoSQL databases have a different cost model than relational databases - you’re paying for throughput (reads/writes per second) and storage separately, which means different optimization strategies.

Phase 1: Capacity Modes and Index Optimization

DynamoDB offers two capacity modes: On-Demand and Provisioned. On-Demand is simpler (you pay per request) but more expensive. Provisioned is cheaper if you can predict your traffic, but you risk throttling if you exceed capacity.

Rule of thumb: predictable, consistent traffic → Provisioned with Auto Scaling. Spiky or unpredictable → On-Demand.

How to decide:

# Get read/write request metrics for the last month
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedReadCapacityUnits \
  --dimensions Name=TableName,Value=your-table \
  --start-time 2024-09-01T00:00:00Z \
  --end-time 2024-10-01T00:00:00Z \
  --period 3600 \
  --statistics Average,Maximum

If your max is <2x your average, your traffic is predictable. Use Provisioned. If your max is >5x your average, you’ve got spikes - use On-Demand or Provisioned with aggressive Auto Scaling.

Global Secondary Indexes (GSIs) are a common source of waste. Each GSI consumes its own read/write capacity and storage. Teams create GSIs for “future query patterns we might need” and then never use them.

List all your GSIs and check their usage:

aws dynamodb describe-table --table-name your-table \
  --query 'Table.GlobalSecondaryIndexes[].IndexName'

Then check CloudWatch for ConsumedReadCapacityUnits on each index. If an index has zero reads for 30 days, delete it.

Time to Live (TTL) is free automatic deletion. If you’re storing session data, temporary cache entries, or anything with a natural expiration, use TTL instead of manually deleting items:

resource "aws_dynamodb_table" "sessions" {
  # ... other config

  ttl {
    attribute_name = "expire_at"
    enabled        = true
  }
}

For Firestore (GCP), the main cost driver is document reads. Every query that scans multiple documents bills for each document read, even if you don’t return them. The optimization is to structure your data so queries are efficient:

Use composite indexes to avoid full collection scans
Denormalize data to reduce the number of reads per query
Cache query results on the client side

Bigtable costs are more straightforward: you pay for nodes (compute) and storage. The optimization is similar to relational databases - rightsize your cluster based on actual CPU usage and storage needs.

Phase 2: TTLs and Access Pattern Reshaping

Phase 2 for NoSQL is about challenging your data model and access patterns.

If you’re storing data in DynamoDB that doesn’t need single-digit millisecond latency, you’re overpaying. DynamoDB has native export to S3 - use it to archive cold data:

aws dynamodb export-table-to-point-in-time \
  --table-arn arn:aws:dynamodb:us-east-1:123456789012:table/YourTable \
  --s3-bucket your-archive-bucket \
  --export-format DYNAMODB_JSON

The math: DynamoDB charges $0.25/GB-month, S3 Standard-IA is $0.0125/GB-month (20x cheaper), and Glacier Deep Archive is $0.001/GB-month (250x cheaper). 4TB in DynamoDB costs $1,000/month; the same data in S3-IA costs $50, or $4 in Glacier. If you ever need to query the archived data, load it back or use Athena for one-off analysis.

Hot partition problems in DynamoDB occur when one partition key gets disproportionate traffic. This forces you to over-provision capacity for the whole table. The solution is to reshape your partition key:

Instead of user_id as partition key (which means one user’s spike affects provisioning), use user_id#date or add a suffix like user_id#random(1-10). This distributes load across partitions.

5. Block Storage & Snapshots (EBS / Persistent Disks)

Block storage is the zombie apocalypse of cloud costs. Volumes get created, attached, detached, and forgotten. Snapshots accumulate. Nobody cleans up.

Phase 1: Orphan Cleanup and Snapshot Management

Unattached volumes (status=available) cost money for nothing. Find them and clean them up as appropriate:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,VolumeType,CreateTime]' \
  --output table

Snapshot sprawl is worse. It’s easy to set up automated snapshots without retention policies, accumulating years of cruft. Use AWS Data Lifecycle Manager to enforce retention (e.g., 7 daily + 4 weekly, then delete).

Volume type optimization is about matching performance to need. io2 volumes (provisioned IOPS SSD) cost 4-5x more than gp3 (general purpose). If you provisioned io2 for a database that actually doesn’t need high IOPS, you’re just wasting money.

Check IOPS usage:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeReadOps \
  --dimensions Name=VolumeId,Value=vol-xxxxx \
  --start-time 2024-09-01T00:00:00Z \
  --end-time 2024-10-01T00:00:00Z \
  --period 3600 \
  --statistics Average,Maximum

If your max IOPS is <3,000, switch to gp3 and configure exactly the IOPS you need.

Phase 2: Moving Cold Data to Object Storage

Phase 2 is about questioning why data lives on expensive block storage.

EBS volumes cost $0.08-0.125/GB-month. S3 Standard costs $0.023/GB-month. If you’ve got data on EBS that’s rarely accessed, you’re paying 4-5x more than necessary.

Common candidates for migration:

Log files stored on disk instead of shipped to S3/CloudWatch
Build artifacts kept locally instead of pushed to S3
Database backups stored on attached volumes instead of S3

Script to identify large files that haven’t been accessed recently:

# On the instance, find files >1GB not accessed in 30 days
find /data -type f -size +1G -atime +30 -exec ls -lh {} \;

Consider moving them to S3 and update your application to write/read from S3 when needed.

6. Networking & Data Transfer

Data transfer is the silent killer. It’s not sexy, it’s hard to debug, and it can cost more than your compute if you’re not careful.

Phase 1: CDNs and Co-location

The first rule of data transfer: keep it local. Cross-region and internet egress is expensive:

Same availability zone: free
Same region, different AZ: $0.01/GB
Cross-region: $0.02-0.09/GB depending on regions
Internet egress: $0.05-0.09/GB

Find where your data transfer costs are coming from:

aws ce get-cost-and-usage \
  --time-period Start=2024-09-01,End=2024-10-01 \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=SERVICE \
  --filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Compute Cloud - Data Transfer"]}}'

For internet egress, a CDN is your best lever. CloudFront or Cloud CDN caches content at edge locations, reducing origin fetches. Cache aggressively - long TTLs for static assets, invalidate only when necessary.

VPC peering and Private Link (AWS) or Private Service Connect (GCP) keep traffic off the public internet. If services in different VPCs are talking over the internet, you’re paying egress for no reason.

Phase 2: Architecture Patterns That Bleed Money

Some architectural decisions can turn into cost disasters:

Cross-region replication for everything. Yes, you want disaster recovery. No, you probably don’t need real-time replication of all of your databases to 5 regions. Consider replicating only critical data, and use asynchronous replication where possible.

Chatty microservices. If your microservices make 50 API calls to each other per user request, and they’re in different regions, you’re hemorrhaging data transfer costs. Consolidate, use caching, or batch requests.

Serving large files from EC2. If users are downloading multi-GB files directly from your EC2 instances, you’re paying egress. Put them in S3, front it with CloudFront, and let the CDN handle distribution.

7. Caching (ElastiCache, DAX / Memorystore)

Caching is a cost amplifier - done right, it reduces database load and saves money. Done wrong, it’s an expensive layer that doesn’t help.

Phase 1: Do You Even Need a Cache?

Before optimizing your cache, ask whether you should have one.

If your cache is approaching your database size, you’re essentially paying for two copies of your data - one slow and cheap (RDS), one fast and expensive (ElastiCache). At that point, consider whether a faster primary store makes more sense. DynamoDB or another NoSQL option might cost less than RDS + massive cache, while eliminating cache invalidation complexity entirely.

Another common antipattern: caching to mask slow queries. If the cache was added because “the database was too slow,” dig into why. Missing indexes, N+1 queries, full table scans, unnecessary joins - these are fixable.

If you do need a cache, do a hit frequency analysis. Access patterns usually follow the 80/20 rule: 20% of your keys are responsible for 80% of hits. You don’t need to cache everything - you need to cache the hot set. A smaller cache holding only frequently-accessed data will have a higher hit rate than a huge cache holding everything. Run a heatmap analysis on your key access patterns before sizing.

Phase 2: Sizing and Tuning

Once you’ve established the cache is necessary, right-size it. Check DatabaseMemoryUsagePercentage in CloudWatch - if it’s consistently below 60%, downsize. Most teams overprovision because “cache misses are bad,” but paying for 64GB when you’re using 20GB is just waste.

Eviction policy matters more than size. allkeys-lru (evict least recently used when full) is usually what you want. noeviction means your cache fills up with stale data and stops accepting new entries - you’re paying for a cache that can’t cache.

A low hit rate doesn’t automatically mean the cache is worthless - 50% hit rate on expensive queries might still pay off. But 30% hit rate on cheap operations means you’re paying more for the cache than you’re saving on the database.

Common reasons for low hit rates:

TTLs too short (data expires before reuse)
Key space too large (caching everything instead of the hot set)
Access patterns too random (no temporal locality)

Part 3: Governance

Cost optimization is a continuous practice, not a one-time project. Without governance, costs creep back up as teams ship features and accumulate data. A few mechanisms that work:

Per-team cost visibility. Generate weekly or monthly cost-change reports broken down by team. Publish them to Slack or a dashboard everyone can see. When teams see their own spend trending up, they self-correct. When costs are invisible or aggregated, nobody owns them.

Anomaly detection. Default budget alerts (“you’ve hit 80% of budget”) are noisy and get ignored. Alert on week-over-week changes instead: “Team X spent $2,000 more this week than last week.” Include context: “S3 costs increased 40% because bucket Y grew from 5TB to 12TB.” AWS Cost Anomaly Detection and GCP budget alerts support custom thresholds.

Policy as code. Enforce cost policies automatically. Cloud Custodian can tag-or-terminate untagged resources, delete orphaned volumes, or flag oversized instances:

policies:
  - name: require-tags
    resource: ec2
    filters:
      - 'tag:Environment': absent
    actions:
      - type: mark-for-op
        op: terminate
        days: 7

Guardrails in Terraform modules. Bake cost constraints into your shared infrastructure modules. If teams use a standardized EC2 module, add validation rules that restrict instance types:

variable "instance_type" {
  type = string
  validation {
    condition     = can(regex("^(t3|t4g|m6i)", var.instance_type))
    error_message = "Only t3, t4g, or m6i instances allowed."
  }
}

This prevents expensive instance sprawl at the source - no one can accidentally spin up an x2idn.24xlarge through your module.

Commitment reviews. Schedule quarterly reviews of your Savings Plans, Reserved Instances, or Committed Use Discounts. Architecture changes, and 3-year commitments made for workloads that no longer exist are expensive mistakes. Start with 1-year terms and partial coverage until your usage patterns stabilize.