Kubernetes Cost Optimization: Right-Sizing with P95/P99 Metrics
How to cut cluster costs by 40–70% by replacing guessed resource configs with metric-driven right-sizing — without sacrificing reliability.
The average Kubernetes cluster runs at 10–20% utilization. The resources you provisioned but aren't using still appear on your cloud bill. For a medium-sized cluster that's often $100k+ per year in waste.
The fix isn't to buy less upfront and hope. It's to measure what your workloads actually consume, then set requests and limits that reflect reality — with just enough headroom to stay reliable. This is right-sizing, and it's the highest-ROI infrastructure optimization I've seen in practice.
Why Over-Provisioning Happens
The common anti-patterns:
- Setting generous requests "to be safe" without ever revisiting them
- Copying resource specs from a similar service without measuring
- Using the same config across dev, staging, and production
- Setting limits without understanding the actual consumption distribution
The result: pods scheduled to nodes with large reservations they never use, cluster utilization stays low, you keep adding nodes, costs compound.
Requests vs. Limits — What Each One Actually Does
Before touching any numbers, get this model clear:
Requests are what the scheduler uses to place the pod on a node. The node reserves exactly this amount. Requests are the basis for HPA and VPA decisions.
Limits are the hard cap. For CPU: the container is throttled when it hits the limit. For memory: it's OOMKilled.
resources:
requests:
cpu: "100m" # scheduler uses this to pick a node
memory: "128Mi"
limits:
cpu: "300m" # throttled if exceeded
memory: "256Mi" # OOMKilled if exceededThe QoS class your pod gets depends on how you set these:
| QoS Class | Requests | Limits | Eviction Priority |
|---|---|---|---|
| Guaranteed | Set | Equal to requests | Last evicted |
| Burstable | Set | Greater than requests | Middle |
| BestEffort | Not set | Not set | First evicted |
For production workloads: Guaranteed or Burstable. BestEffort for batch jobs where eviction is acceptable.
The Right-Sizing Formula
Don't guess. Collect 7–14 days of real traffic data, then apply these rules:
| Resource | Request | Limit | Rationale |
|---|---|---|---|
| CPU | P50 + 20% | P95 + 30% | Requests cover median; limits handle spikes |
| Memory | P95 + 10% | P99 + 20% | Memory can't be throttled — OOM margin matters |
| Batch jobs | P99 + 10% | P99 + 20% | Predictable load, size conservatively |
| Databases | P95 + 50% | P99 + 100% | Critical — give extra headroom |
Getting the Metrics
Start with Prometheus. These queries give you the percentile distribution you need:
# CPU usage percentiles (7-day window)
histogram_quantile(0.50, rate(container_cpu_usage_seconds_total[7d]))
histogram_quantile(0.95, rate(container_cpu_usage_seconds_total[7d]))
histogram_quantile(0.99, rate(container_cpu_usage_seconds_total[7d]))
# CPU throttling — if this is high, your limits are too low
rate(container_cpu_cfs_throttled_seconds_total[5m]) /
rate(container_cpu_cfs_periods_total[5m]) * 100
# Memory usage percentiles
histogram_quantile(0.95, rate(container_memory_working_set_bytes[7d]))
histogram_quantile(0.99, rate(container_memory_working_set_bytes[7d]))
# OOMKill events — if this is non-zero, your limits are too low
increase(container_oom_kills_total[1h])Two signals to watch for before you start cutting:
- High throttling rate (>10%): current CPU limits are already too low. Fix this before optimizing down.
- OOMKill events: current memory limits are too low. Same — fix before cutting.
Finding over-provisioned pods:
# Pods where CPU usage is < 20% of their request
(
rate(container_cpu_usage_seconds_total[5m]) /
(container_spec_cpu_quota / container_spec_cpu_period)
) * 100 < 20
# Pods where memory usage is < 30% of their limit
(
container_memory_working_set_bytes /
container_spec_memory_limit_bytes
) * 100 < 30Roll Out Gradually
Don't touch production first. Phase it:
Week 1–2: Collect baseline metrics. Install Prometheus + Grafana if you don't have them:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
kubectl top nodes && kubectl top pods --all-namespacesWeek 3: Non-critical workloads and dev environments. Dev is often the biggest quick win — 90% waste is common.
# Dev: aggressive cuts are safe
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "128Mi"Week 4: Staging, using P95/P99 values.
Weeks 5–6: Production, with a conservative 25–50% safety buffer added on top of the formula values.
After each wave: check OOMKills, CPU throttling, and pod restart counts before proceeding.
# Post-change monitoring
kubectl get events --field-selector reason=OOMKilling --all-namespaces
kubectl get pods --all-namespaces -o custom-columns=\
NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
RESTARTS:.status.containerStatuses[*].restartCountAutomate with VPA
Once you've done the manual right-sizing pass, use Vertical Pod Autoscaler to maintain it:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Start with Off to get recommendations without auto-applying
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 1
memory: 1Gi
controlledResources: ["cpu", "memory"]Start with updateMode: "Off" — VPA will write its recommendations to the object status so you can review them before they're applied. Move to "Auto" once you trust the recommendations.
Pair VPA (vertical) with HPA (horizontal) for workloads with variable traffic:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown before scaling down
policies:
- type: Percent
value: 10
periodSeconds: 60Set Guardrails with Resource Quotas
Prevent new services from over-provisioning by default with namespace-level guardrails:
# Cap total resource consumption per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
---
# Give containers sensible defaults when nothing is specified
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- default:
cpu: "200m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: ContainerThe LimitRange is particularly important: without it, a container with no resource spec gets BestEffort QoS and becomes the first candidate for eviction under pressure.
Alerting on Waste and Risk
Add these Prometheus recording rules and alerts. The waste ratio alerts catch over-provisioning proactively; the OOMKill and throttling alerts catch when you've cut too far:
groups:
- name: cost-optimization
rules:
- record: kubernetes:pod_cpu_waste_ratio
expr: |
(
(container_spec_cpu_quota / container_spec_cpu_period) -
rate(container_cpu_usage_seconds_total[5m])
) / (container_spec_cpu_quota / container_spec_cpu_period)
- record: kubernetes:pod_memory_waste_ratio
expr: |
(
container_spec_memory_limit_bytes -
container_memory_working_set_bytes
) / container_spec_memory_limit_bytes
- name: cost-alerts
rules:
- alert: HighResourceWaste
expr: kubernetes:pod_cpu_waste_ratio > 0.7
for: 1h
labels:
severity: warning
annotations:
summary: "{{ $labels.pod }} is wasting >70% of CPU allocation"
- alert: FrequentOOMKills
expr: increase(container_oom_kills_total[1h]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "{{ $labels.pod }} has frequent OOM kills — increase memory limit"
- alert: ExcessiveCPUThrottling
expr: |
(
rate(container_cpu_cfs_throttled_seconds_total[5m]) /
rate(container_cpu_cfs_periods_total[5m])
) * 100 > 25
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.pod }} throttled >25% — increase CPU limit"Real-World Results
A case study from an e-commerce microservices platform:
| Before | After | |
|---|---|---|
| Cluster | 50 nodes (16 vCPU / 32 GB) | 30 nodes (8 vCPU / 16 GB) |
| Monthly cost | $12,000 | $4,800 |
| Avg utilization | 15% | 65% |
| Resource waste | 70% | 15% |
CPU changes per workload type:
| Workload | Before | After | Reduction |
|---|---|---|---|
| Web frontend | 1000m | 200m | 80% |
| API gateway | 2000m | 500m | 75% |
| Database | 4000m | 2000m | 50% |
| Background jobs | 500m | 100m | 80% |
Key learnings from that engagement:
- P95-based requests gave optimal scheduling without padding
- P99-based limits handled traffic spikes without throttling
- Memory right-sizing had higher ROI than CPU in this workload mix
- Dev environments had ~90% waste and were the fastest wins
The goal isn't to minimize resource allocations — it's to match allocations to actual consumption. Too low and you get OOMKills and throttling; too high and you pay for nothing. Measure first, then cut. P95/P99 metrics give you the data to do it without guessing.
Why This Matters Beyond One Company
Kubernetes cluster waste is a systemic problem across the US technology sector. The average cluster runs at 10–20% utilization — meaning organizations are paying for four to ten times the compute they actually use. At enterprise scale, this translates to millions of dollars per year in cloud spend with no corresponding business value. The US government's own cloud spending reports (OMB and GAO) have consistently identified resource overprovisioning as one of the top drivers of federal IT inefficiency; the same dynamic plays out across every industry running containerized workloads.
The right-sizing methodology documented here — P95/P99 metric collection, phased rollout from dev to production, VPA automation, and waste-ratio alerting — is not organization-specific. It is directly applicable to any Kubernetes workload, regardless of cloud provider, industry, or cluster size. The case study in this article achieved a 60% cost reduction ($7,200/month saved) on a mid-sized microservices platform; at the scale of US enterprise Kubernetes adoption, the aggregate recoverable waste from this class of optimization runs into the billions annually.
This is the kind of engineering work that compounds beyond the team that does it first: documented, reproducible, and immediately applicable by any platform or DevOps engineer reading this today.