LLM Fallback Is Infrastructure Now, Not a Nice-to-Have

The Pattern the Source Got Right

Sajal's piece walks through a production incident where an LLM provider started timing out and took down an entire AI feature. The first instinct was to check pods, deployments, business logic. Everything looked fine. The failure was upstream, at the model endpoint. This is the new normal: your AI feature dies not because your code broke, but because someone else's API is slow.

The architecture he landed on is worth stealing: a Kubernetes CronJob hits Bedrock health checks every 60 seconds, writes results to a ConfigMap, and the application pods read that ConfigMap to decide which provider to route to. If Bedrock's Claude Sonnet endpoint is degraded, traffic fails over to Anthropic's direct API. The ConfigMap becomes the source of truth for routing decisions, updated out-of-band by the health checker.

This is correct. Fallback logic does not belong in your application code. It belongs in the orchestration layer, where you can change providers without redeploying services. If you are putting try/catch blocks around every LLM call and failing over inline, you have already lost. You are measuring downtime in minutes, not seconds, because every request has to fail before it tries the backup.

Why EKS Makes This Work

The source runs this stack on EKS, and that choice matters. Kubernetes gives you a place to run the health check loop that is not inside your application and not a separate service you have to manage. A CronJob is cattle, not a pet. If it crashes, Kubernetes restarts it. If you need to change the health check logic, you redeploy the job without touching the app.

The ConfigMap pattern is the key primitive. Your application pods mount the ConfigMap as a volume or read it via the API. When the health checker updates the ConfigMap, pods see the new routing decision within seconds (depending on your kubelet sync interval, default 60 seconds, tunable to 10). You are not passing state through a database, a cache, or a message queue. You are using the control plane Kubernetes already has.

One gap in the source: he does not say whether he is using ConfigMap watch semantics or polling. If you poll the ConfigMap every 10 seconds from every pod, you are fine at 50 pods but wasteful at 500. The correct move is a single-replica Deployment that watches the ConfigMap and writes routing decisions to a shared in-memory store (Redis, Valkey, or even a local cache if your pod count is low). The application pods read from that store. This keeps the Kubernetes API server out of your hot path.

Where to Put the Health Check Logic

Sajal's health checker calls Bedrock with a test prompt and measures latency. If latency crosses a threshold or the call fails, the checker flips the ConfigMap to route traffic to Anthropic's API. This works, but it is a binary switch. You are either on Bedrock or you are on Anthropic. If both providers are degraded, you are out of options.

The next step is a weighted routing table. The health checker writes a JSON blob to the ConfigMap: {"bedrock_claude_sonnet": 0.7, "anthropic_claude_sonnet": 0.3}. Your application reads that blob and routes 70% of traffic to Bedrock, 30% to Anthropic. If Bedrock degrades further, the weights shift to 50/50, then 30/70, then 0/100. You are load-shedding gradually instead of failing over all at once.

This requires two things: a health score function (latency percentiles, error rate, maybe cost per token if you are tracking that) and a router in your application that can interpret weights. The router is 50 lines of code. The health score function is harder because you need to decide what "degraded" means. Is it p99 latency over 5 seconds? Is it error rate over 2%? Is it three consecutive failures? You are encoding an SLA into code. Write it down, test it in staging, and be ready to tune it after the first production incident.

The Cost Trade You Are Making

Running two LLM providers in production means you are paying for two sets of rate limits, two sets of API keys, two sets of IAM roles. If you are on Bedrock, you are paying AWS margins on top of Anthropic's model pricing. If you fail over to Anthropic's direct API, you are paying list price but you lose Bedrock's built-in logging, guardrails, and PrivateLink integration.

The source does not say whether he is keeping both providers warm or cold-starting the fallback. If you cold-start, your first request after failover will be slow because the model has to load. If you keep both warm, you are sending a small amount of traffic to the backup provider even when the primary is healthy, which costs money but gives you real signal that the backup works. I would run 5% of production traffic through the fallback path at all times. That is your fire drill. If the fallback breaks, you know before you need it.

What This Means for Agentforce Stacks

If you are deploying Agentforce with Data Cloud and you want Claude as your model backend, you have two paths. Path one: Agentforce calls Bedrock, Bedrock calls Claude. Path two: Agentforce calls Claude directly via MCP. Path one gives you Bedrock's observability and guardrails. Path two gives you faster failover and no AWS middleman. You cannot have both unless you build the routing layer yourself, which is what the source describes.

Maple's multi-cloud orchestrator gives you this routing layer without writing Kubernetes CronJobs. You define the health check logic in a YAML file, and the orchestrator runs it. You define the routing weights, and the orchestrator applies them. If you are building this stack in-house, you are building Maple. If you are buying Maple, you are getting the routing layer and the health checks and the observability in one place, and you are not maintaining CronJobs.

Fallback is not a feature. It is the foundation.