architecture

Error Recovery & Graceful Degradation

The difference between 'AI-powered' and 'enterprise-ready' isn't accuracy—it's what happens when the AI fails. Graceful degradation means the system keeps working, just slightly less smart.

  • Graceful degradation = when Plan A fails, use Plan B (or C, or D) instead of crashing
  • Retry → simplified model → cache → safe default: a chain of fallbacks
  • 85% recovery rate means 85% of failures get handled without human intervention
  • The goal: reduce "it broke, call IT" to "it worked, but maybe slower"
  • Critical for AI agents, APIs, and any system where external services can fail

Real-world example

An AI CRM agent qualifies a lead, but the LLM API is down

The agent needs to extract intent from a conversation. Normally it calls GPT-4. Tonight the API returns a 503.

  • Plan A: Call GPT-4. Fails → 503.
  • Plan B: Retry with exponential backoff. Still 503.
  • Plan C: Fall back to cached response if we've seen similar input. Miss.
  • Plan D: Use a simpler rule-based classifier. Works. Lead gets qualified, just with less nuance.
  • Plan E: Mark for human review. Last resort, but the workflow doesn't block.
  • The user never sees an error. The lead still moves forward. The system degrades gracefully.

Graceful degradation means "do something useful" instead of "show an error page."

When a component fails (API, model, database), the system doesn't stop—it uses a fallback that's less ideal but still functional. Users get a result. The workflow continues. You fix the primary path later.

  1. Retry: Maybe it was a transient failure. Try again with backoff.
  2. Alternate source: Use a different API, model, or data source.
  3. Cache: Do we have a recent result we can reuse?
  4. Simplified logic: Strip features. Rule-based instead of ML. Slower but reliable.
  5. Safe default: Return something neutral (e.g. "needs review") instead of crashing.

Each step trades capability for reliability. The goal is to stay as high on the chain as possible.

LLMs are non-deterministic and APIs have rate limits and outages. An agent that assumes "the API always works" will fail in production. One that has fallbacks can achieve 85%+ recovery without human intervention.

  • Assuming external services never fail—they do, and you can't control them
  • Silently swallowing errors instead of logging and alerting
  • Fallbacks that are worse than failing (e.g. returning wrong data)
  • Not testing failure paths—if you've never simulated a 503, you don't know if fallbacks work

Building AI agents that need to be enterprise-ready?

More resources

All resources →