system-design

Circuit Breaker Pattern

Stop calling failing services before they take down your entire system.

Tiny Summary

Circuit Breaker automatically stops calling a failing service to prevent cascading failures. Like a house circuit breaker, it "trips" when detecting failures, gives the system time to recover, then carefully tests if it's safe to resume.

The Problem

Your app calls a payment processor. The processor goes down. What happens next?

Without Circuit Breaker:

Request to payment processor times out (30 seconds)
Your server waits, blocking a thread
More requests come in, all wait
All threads blocked, your server can't handle ANY requests
Your entire app is down because ONE dependency failed

This is a cascading failure. One service kills everything.

How Circuit Breakers Work

Three states: Closed (working), Open (failing), Half-Open (testing recovery).

Normal Operation (Closed)
  → Failures exceed threshold
    → Open (reject requests immediately)
      → Wait timeout period
        → Half-Open (try 1 request)
          → Success? → Closed
          → Failure? → Open

Closed (Normal):

Requests flow through normally
Track failures (last 100 requests, 50% failure rate? Trip!)

Open (Failing):

Immediately reject requests without calling service
Fail fast (return error in milliseconds, not 30 second timeout)
Give failing service time to recover

Half-Open (Testing):

After timeout (e.g., 60 seconds), try ONE request
Success → back to Closed, resume normal traffic
Failure → back to Open, wait longer

Real Example

Stripe payment processing:

10:00 AM - Circuit Closed
Requests: ✅✅✅✅✅ (all succeed)

10:15 AM - Stripe starts failing
Requests: ✅✅❌❌❌❌❌ (50% failure rate)
Circuit trips → OPEN

10:15-10:16 AM - Circuit Open
All payment requests instantly rejected with friendly error:
"Payment processing temporarily unavailable, try again in a moment"
(Your app stays up, users can browse catalog)

10:16 AM - Circuit tries Half-Open
Test request: ✅ Success!
Circuit → CLOSED, resume normal operation

The difference:

Without breaker: 30 sec timeouts, all threads blocked, site down
With breaker: Instant failures, users see error but app works, site up

Configuration

Failure threshold: 50% of requests failing in last 100 requests → trip

Timeout period: 60 seconds before trying Half-Open

Success threshold: 1 successful request in Half-Open → close circuit

What counts as failure:

Timeouts (request took over 5 seconds)
500 errors from service
Network errors
NOT 400 errors (those are client errors, not service failures)

When to Use

Essential for:

External API calls (payment processors, email services, SMS)
Microservice communication (Service A calls Service B)
Database queries (if database struggles, fail fast)

Not needed for:

Reading from cache (already fast)
Local computations (no network involved)
Critical operations that MUST be attempted (use retries instead)

Graceful Degradation

Circuit breaker enables fallback behavior.

Payment processing down:

Don't: Block checkout completely
Do: Queue payment for later, email invoice, let user continue

Recommendation engine down:

Don't: Show error on homepage
Do: Show popular items instead of personalized recommendations

Email service down:

Don't: Fail user signup
Do: Create account, queue welcome email for later

Users don't notice the degradation — they get a working app with slightly reduced features.

Implementation

Popular libraries:

Node.js: opossum
Python: pybreaker
Java: Resilience4j, Hystrix
Go: gobreaker

Basic setup (Node.js):

const CircuitBreaker = require('opossum');

const breaker = new CircuitBreaker(callStripeAPI, {
  timeout: 5000,        // 5 second timeout
  errorThreshold: 50,   // Trip at 50% errors
  resetTimeout: 60000   // Try again after 60 seconds
});

breaker.on('open', () => {
  console.log('Circuit opened! Stripe is down.');
  alertTeam('Payment processing degraded');
});

breaker.on('close', () => {
  console.log('Circuit closed! Stripe recovered.');
});

Common Mistakes

Opening circuit too aggressively

Don't trip on 1 error. Allow some failures (networks are unreliable). Trip at sustained failure rates (50% over 100 requests).

Not monitoring circuit state

If your circuit is open, you need to know. Alert on state changes. Log to metrics.

No fallback behavior

Circuit opens, requests fail... then what? Always have a fallback: cached data, degraded features, queue for later.

Using retries AND circuit breaker without coordination

Retries fight circuit breakers. If circuit is open, don't retry — accept the failure and use fallback.

Key Insights

Circuit breakers prevent your dependencies from taking down your app. One failing service shouldn't kill everything.

Fail fast is better than slow timeouts. Users prefer a quick error over a 30-second loading spinner.

Monitor circuit state — if circuits are frequently open, your dependencies are unreliable. Fix or find alternatives.

Use the simulation to see how circuit breakers prevent cascading failures!