system-design

Rate Limiting

Stop attackers from bankrupting your API bill and crashing your servers.

Tiny Summary

Rate limiting restricts how many requests a user/IP can make in a time window. It prevents abuse, protects infrastructure, and controls costs. Without it, one attacker (or buggy client) can take down your entire service.

The Problem

Without rate limiting:

A bot hits your API 10,000 times per second:

Your servers melt from load
Your database crashes
Legitimate users can't access the app
Your AWS bill explodes to $50,000
Your business is offline

With rate limiting:

Same bot hits your API:

First 100 requests/minute: ✅ Accepted
Remaining 9,900 requests: ❌ Rejected (429 Too Many Requests)
Your servers handle normal load
Legitimate users unaffected
AWS bill stays normal
Your business stays online

Real Attacks Prevented

Credential stuffing:

Attacker tries 1M username/password combos
Without rate limit: All attempts go through, accounts get compromised
With rate limit: 10 attempts per IP per minute, attack useless

API abuse:

Someone scrapes your entire product catalog
Without rate limit: They download everything, resell your data
With rate limit: Can only fetch 100 products/hour, scraping impractical

DDOS:

Botnet sends 100k requests per second
Without rate limit: Servers crash, site goes down
With rate limit: Excess requests rejected instantly, site stays up

Cost control:

Your mobile app has a bug, makes infinite API calls
Without rate limit: $10k AWS bill from one user's broken app
With rate limit: User hits limit, you get alerted, fix bug before damage

Rate Limiting Strategies

1. Fixed Window

Simple but has edge case issues.

Time windows: 10:00-10:01, 10:01-10:02, etc.
Limit: 100 requests per minute

10:00:00 - 10:00:59: 100 requests ✅
10:01:00 - 10:01:59: 100 requests ✅

Edge case problem:
10:00:30 - 10:00:59: 100 requests ✅
10:01:00 - 10:01:29: 100 requests ✅
= 200 requests in 60 seconds (double the limit!)

2. Sliding Window

More accurate, prevents edge case.

Limit: 100 requests per 60-second window

Track timestamps of all requests.
At any moment, count requests in last 60 seconds.

10:00:30: Check requests since 09:59:30
10:01:00: Check requests since 10:00:00

No edge case — always enforces true rate.

3. Token Bucket

Most flexible, allows bursts.

Bucket holds 100 tokens, refills at 10 tokens/second.

Request comes in:
- Has token? → Accept, remove token
- No token? → Reject (429)

Allows bursts (use 50 tokens instantly if available)
Then throttles to refill rate (10/sec)

4. Leaky Bucket

Smooths out traffic, no bursts.

Requests go into queue (bucket).
Leak out at fixed rate (10/sec).

If bucket full → reject new requests

Guarantees smooth output rate.
Use when you need predictable load.

Implementation

Where to implement:

Application level (easy, per-server limits)
API Gateway (AWS API Gateway, Kong, Nginx)
CDN/Proxy (Cloudflare, Fastly — best for DDOS)

What to limit by:

IP address (prevent one attacker from spamming)
User ID (logged-in users, fair usage)
API key (third-party integrations)
Endpoint (expensive endpoints get lower limits)

Common limits:

Anonymous users: 10 requests/minute
Logged-in users: 100 requests/minute
Premium users: 1000 requests/minute
Admin endpoints: 5 requests/minute (more sensitive)

Response to Rate Limit

Return 429 status:

HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1642345678

{
  "error": "Rate limit exceeded. Try again in 60 seconds."
}

Headers to include:

X-RateLimit-Limit: Total allowed requests
X-RateLimit-Remaining: Requests left in window
X-RateLimit-Reset: Timestamp when limit resets
Retry-After: Seconds until they can retry

Layered Rate Limiting

Different limits for different severity.

Layer 1: Global (per IP)

1000 requests/minute across all endpoints
Catches broad attacks

Layer 2: Per Endpoint

Login: 5 attempts/minute (prevent brute force)
Search: 100 requests/minute
Create order: 10 requests/minute
Profile view: 200 requests/minute

Layer 3: Per User

Free tier: 100 API calls/day
Pro tier: 10,000 API calls/day
Enterprise: Unlimited

Real-World Example

Stripe API:

Rate limits:
- Default: 100 requests/second
- Bursts: Up to 1000 requests in short window
- Per endpoint: Some endpoints have lower limits
- Response headers show remaining quota

Twitter API:

Rate limits:
- 15 requests per 15-minute window (user timeline)
- 900 requests per 15-minute window (search)
- Different limits for different tiers (free vs. paid)

Monitoring and Alerting

Track these metrics:

Requests hitting rate limit (high % = too restrictive or under attack)
Top IPs by request volume (identify attackers or bugs)
Rejected requests per endpoint (tune limits)

Alert when:

10% of requests hitting rate limit (maybe under attack)
Single IP hits limit repeatedly (likely attacker, block at firewall)
Legitimate users complaining (limits too strict, need to raise)

Common Mistakes

No rate limiting at all

"We're too small to be attacked" → Wrong. Bots scan all IPs, you'll be hit eventually.

Limits too strict

Legitimate users hit limits → bad experience → churn

Limits too loose

Attackers still get through → defeats the purpose

Not differentiating users

Treat premium customers same as anonymous users → premium customers annoyed

Counting failed requests

If login fails (bad password), don't count against limit → allows unlimited password attempts

Actually, DO count failed logins against limit → prevents brute force!

Key Insights

Rate limiting is mandatory for production apps. Without it, you're one attack away from a $50k AWS bill or total downtime.

Start conservative, monitor, adjust. Better to have limits too strict and raise them than too loose and get exploited.

Layer your limits: global (prevent DDOS), per-endpoint (protect sensitive operations), per-user (fair usage).

Use the simulation to see how different rate limiting strategies perform under attack!