We got paged at 2 AM on a Saturday because a payment processing service was silently eating errors. A third-party API had started returning 429 (rate limit) responses, but our code caught the exception, logged a generic “payment failed” message, and returned null to the caller. The caller treated null as “no payment found” and continued processing the order as if the customer had a free account. By the time we noticed, several hundred orders had been processed with zero charges. The post-mortem took longer than the fix.
The root cause wasn’t the rate limit. It was that our error handling made the failure invisible. The function signature said it returned Payment | null, but null meant three completely different things depending on where the error came from: no payment existed, the API was unreachable, or the request was malformed. Callers had no way to tell which. And because the error was caught and swallowed deep in the stack, nothing in our monitoring flagged it.
We’ve spent years refining how we handle errors after incidents like this. What follows is everything we’ve learned, not just the patterns but the reasoning behind them and the places where each one breaks down.
The core problem: invisible failures
TypeScript’s type system doesn’t track exceptions. A function signature can lie to you:
function parseConfig(raw: string): Config {
return JSON.parse(raw); // Can throw SyntaxError!
}
const config = parseConfig(input);
// TypeScript says config is Config, but we might never get here
The type says “this returns Config.” It doesn’t mention that it might explode and unwind your entire call stack. Callers have no indication they need to handle failure.
This isn’t a TypeScript bug. It’s a design choice inherited from JavaScript. Languages like Rust and Go went a different direction: Rust’s Result<T, E> type makes errors part of the return type, and Go’s convention of returning (value, error) tuples forces callers to acknowledge failures at every call site. Both have tradeoffs, but they share an important property: failures are visible in the function signature.
The most dangerous errors aren’t the ones that crash your program. They’re the ones that let it keep running with corrupted state.
TypeScript gives us enough type system power to approximate these patterns. We just have to build the infrastructure ourselves.
Pattern 1: Result types for expected failures
When failure is a normal part of the operation, not a bug but a valid outcome, we make it explicit in the return type:
type Result<T, E = Error> =
| { ok: true; value: T }
| { ok: false; error: E };
This is the simplest possible version, and for many codebases it’s all you need. The function signature now communicates that failure is possible, and TypeScript’s narrowing ensures callers handle both cases:
function parseConfig(raw: string): Result<Config, SyntaxError> {
try {
return { ok: true, value: JSON.parse(raw) };
} catch (e) {
return { ok: false, error: e as SyntaxError };
}
}
const result = parseConfig(input);
if (!result.ok) {
console.error('Failed to parse:', result.error.message);
return;
}
// TypeScript knows result.value is Config here
The failure mode is visible in the type signature. Code reviewers can see it. New engineers can see it. The compiler enforces it.
Making Results composable
The basic Result type works well for single operations, but real code chains things together. Parsing a config, then validating it, then using it to initialize a service. Each step can fail, and we don’t want nested if (!result.ok) checks at every level.
We add a small set of helpers that let us chain Results:
function mapResult<T, U, E>(
result: Result<T, E>,
fn: (value: T) => U
): Result<U, E> {
if (!result.ok) return result;
return { ok: true, value: fn(result.value) };
}
function flatMapResult<T, U, E>(
result: Result<T, E>,
fn: (value: T) => Result<U, E>
): Result<U, E> {
if (!result.ok) return result;
return fn(result.value);
}
function mapError<T, E, F>(
result: Result<T, E>,
fn: (error: E) => F
): Result<T, F> {
if (result.ok) return result;
return { ok: false, error: fn(result.error) };
}
mapResult transforms the success value without touching errors. Think “if this worked, do this next thing with the value.” flatMapResult chains operations that themselves return Results, so we don’t end up with Result<Result<T, E>, E>. mapError transforms error types when crossing module boundaries.
Here’s what a real pipeline looks like:
function loadAppConfig(path: string): Result<AppConfig, ConfigError> {
const raw = readFileSync(path);
const parsed = flatMapResult(raw, parseConfig);
const validated = flatMapResult(parsed, validateConfig);
return validated;
}
Each step either succeeds and feeds into the next, or fails and short-circuits the rest. No nested if statements. No try-catch blocks. The error propagates cleanly through the chain.
For teams that want a more complete implementation with method chaining, the neverthrow library provides a battle-tested Result type with .map(), .andThen(), and other combinators. We’ve used it on larger projects where the ergonomics matter more, but the plain-function approach above works fine for most codebases and has zero dependencies.
The value of Result types isn’t just error handling. It’s that they make the “what can go wrong” question answerable by reading a type signature instead of reading the implementation.
Pattern 2: Discriminated union errors
A Result<T, Error> tells callers that something can fail, but not how it can fail. The “how” matters a lot. A network timeout needs a retry. A validation error needs user feedback. An authorization failure needs a redirect to login. Callers need enough information to make the right decision.
We define error types as discriminated unions, where each variant carries the specific data needed for that failure mode:
type PaymentError =
| { kind: 'validation'; field: string; message: string }
| { kind: 'gateway_timeout'; retryAfterMs: number }
| { kind: 'insufficient_funds'; available: number; required: number }
| { kind: 'card_declined'; declineCode: string }
| { kind: 'not_found'; paymentId: string };
The kind field is the discriminant. TypeScript narrows the type when we switch on it, so each branch has access to only the relevant data:
function handlePaymentResult(result: Result<Payment, PaymentError>): void {
if (result.ok) {
showConfirmation(result.value);
return;
}
switch (result.error.kind) {
case 'validation':
highlightField(result.error.field, result.error.message);
break;
case 'gateway_timeout':
scheduleRetry(result.error.retryAfterMs);
break;
case 'insufficient_funds':
showFundingPrompt(result.error.required - result.error.available);
break;
case 'card_declined':
showDeclineMessage(result.error.declineCode);
break;
case 'not_found':
redirectTo404(result.error.paymentId);
break;
}
}
This is way better than catching a generic Error and parsing its message string. The data is typed. The compiler checks exhaustiveness. Each error variant carries exactly the context its handler needs.
Exhaustiveness checking
We add a small helper to make sure we handle every variant:
function assertNever(value: never): never {
throw new Error(`Unhandled variant: ${JSON.stringify(value)}`);
}
// In the switch:
default:
assertNever(result.error);
If we add a new variant to PaymentError later, every switch statement that doesn’t handle it becomes a compile error. In large codebases where error types evolve over time, the compiler finds every callsite that needs updating. That’s worth a lot.
Crossing module boundaries
When errors cross module boundaries, we map them to the vocabulary of the calling layer. A database module shouldn’t leak PgDatabaseError details to an HTTP handler. The service layer translates instead:
function mapDbErrorToServiceError(err: DbError): OrderError {
switch (err.kind) {
case 'unique_violation':
return { kind: 'duplicate_order', orderId: err.constraintDetail };
case 'connection_failed':
return { kind: 'service_unavailable', retryAfterMs: 5000 };
case 'query_timeout':
return { kind: 'service_unavailable', retryAfterMs: 10000 };
}
}
Each layer has its own error vocabulary, and the mapping between them is explicit and testable.
Pattern 3: Choosing the right tool
We don’t use Result types everywhere. That would be tedious and obscure simple code. Here’s our decision framework:
| Failure type | Pattern | Example | Why |
|---|---|---|---|
| Programmer error (bug) | Throw exception | Array index out of bounds, invalid state | Should never happen in correct code. Crashing is safest. |
| Expected domain failure | Result with typed errors | Payment declined, validation failure | Callers need to decide based on what went wrong. |
| Expected infrastructure failure | Result with retry context | Network timeout, rate limit | Callers need to know if and when to retry. |
| Simple absence | T | null | Cache miss, user lookup by ID | Not finding something is normal and needs no error detail. |
| Background/fire-and-forget | Log and continue | Analytics event failed, non-critical cache write | Failure doesn’t affect the user-facing operation. |
The question isn’t “should we handle errors?” It’s “how much information does the caller need to handle this failure correctly?” That determines which pattern fits.
Pattern 4: Error handling in async pipelines
Most real applications are async, and error handling in async code has its own pitfalls. The most common mistake we see is mixing try/catch with .catch() inconsistently, or worse, forgetting to await a promise and losing the error entirely.
Async Result functions
We extend the Result pattern to async operations:
type AsyncResult<T, E = Error> = Promise<Result<T, E>>;
async function fetchUser(id: string): AsyncResult<User, ApiError> {
try {
const response = await fetch(`/api/users/${id}`);
if (!response.ok) {
return {
ok: false,
error: { kind: 'http_error', status: response.status }
};
}
const data = await response.json();
return { ok: true, value: data as User };
} catch (e) {
return {
ok: false,
error: { kind: 'network_error', message: (e as Error).message }
};
}
}
The important thing: try/catch lives at the boundary where we call the underlying API. Once we’ve wrapped it in a Result, everything downstream is pure data flow. No exceptions to worry about.
Chaining async Results
Chaining async Results needs an async version of flatMapResult:
async function flatMapAsync<T, U, E>(
result: AsyncResult<T, E>,
fn: (value: T) => AsyncResult<U, E>
): AsyncResult<U, E> {
const resolved = await result;
if (!resolved.ok) return resolved;
return fn(resolved.value);
}
Now we can build async pipelines that short-circuit on the first error:
async function createOrder(input: OrderInput): AsyncResult<Order, OrderError> {
const user = await fetchUser(input.userId);
const inventory = await flatMapAsync(
Promise.resolve(user),
(u) => checkInventory(u, input.items)
);
const payment = await flatMapAsync(
Promise.resolve(inventory),
(inv) => processPayment(inv, input.paymentMethod)
);
return flatMapAsync(
Promise.resolve(payment),
(pmt) => saveOrder(input, pmt)
);
}
Each step only runs if the previous one succeeded. If checkInventory fails, we never attempt payment. The error propagates up with its full type information intact.
The unhandled promise trap
One pattern we enforce through linting: never call an async function without awaiting or explicitly handling the returned promise. The @typescript-eslint/no-floating-promises rule catches this, and we consider it non-negotiable:
// DANGEROUS: if sendNotification rejects, the error disappears
sendNotification(user, order);
// SAFE: explicitly detached with error handling
sendNotification(user, order).catch((err) => {
logger.warn('Notification failed', { err, userId: user.id });
});
// ALSO SAFE: awaited in the normal flow
await sendNotification(user, order);
Pattern 5: Error boundaries with resilience strategies
Simple error boundaries (catch and return a fallback) work for basic cases. But production systems need more: retries, circuit breakers, and graceful degradation.
Retry with exponential backoff
Transient failures like network blips or brief service restarts often succeed on a second attempt. We use a retry wrapper with exponential backoff and jitter:
interface RetryOptions {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
shouldRetry: (error: unknown, attempt: number) => boolean;
}
async function withRetry<T>(
fn: () => Promise<T>,
options: RetryOptions
): Promise<T> {
let lastError: unknown;
for (let attempt = 1; attempt <= options.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
if (attempt === options.maxAttempts || !options.shouldRetry(error, attempt)) {
throw error;
}
const baseDelay = Math.min(
options.baseDelayMs * Math.pow(2, attempt - 1),
options.maxDelayMs
);
const jitter = Math.random() * baseDelay;
await new Promise((resolve) => setTimeout(resolve, baseDelay + jitter));
}
}
throw lastError;
}
The shouldRetry callback matters. Not all errors are retryable. A 400 Bad Request will fail every time, but a 503 Service Unavailable might work in a few seconds:
const result = await withRetry(
() => callPaymentGateway(request),
{
maxAttempts: 3,
baseDelayMs: 200,
maxDelayMs: 5000,
shouldRetry: (error) => {
if (error instanceof HttpError) {
return error.status >= 500 || error.status === 429;
}
return error instanceof TypeError && error.message.includes('fetch');
}
}
);
Circuit breaker
Retries help with transient failures, but if a downstream service is genuinely down, retrying every request just piles on load. A circuit breaker tracks failure rates and stops making calls when a threshold is hit:
type CircuitState = 'closed' | 'open' | 'half-open';
class CircuitBreaker {
private state: CircuitState = 'closed';
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private readonly threshold: number,
private readonly resetTimeoutMs: number
) {}
async call<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
this.state = 'half-open';
} else {
return fallback();
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
return fallback();
}
}
private onSuccess(): void {
this.failureCount = 0;
this.state = 'closed';
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.threshold) {
this.state = 'open';
}
}
}
Usage:
const recommendationBreaker = new CircuitBreaker(5, 30_000);
async function getProductPage(productId: string): Promise<ProductPage> {
const product = await fetchProduct(productId); // Critical: let this throw
const recommendations = await recommendationBreaker.call(
() => fetchRecommendations(productId),
() => [] // Show no recommendations if the service is struggling
);
return { product, recommendations };
}
The breaker opens after 5 consecutive failures and stays open for 30 seconds. During that window, we skip the call entirely and use the fallback. After the timeout, a single request goes through (half-open state) to test if the service has recovered.
Graceful degradation
For features that matter, we define a degradation chain, a sequence of increasingly cheaper fallbacks:
async function getUserFeed(userId: string): Promise<FeedItem[]> {
// Try the personalized recommendation service
const personalized = await recommendationBreaker.call(
() => fetchPersonalizedFeed(userId),
async () => null
);
if (personalized !== null) return personalized;
// Fall back to cached popular content
const cached = await cacheBreaker.call(
() => fetchCachedPopularFeed(),
async () => null
);
if (cached !== null) return cached;
// Last resort: static curated content bundled with the app
return getStaticFallbackFeed();
}
The user always gets something. Quality degrades as services fail, but the application never shows a blank page. And because each level is explicit, we can monitor which one is active and alert when we’re running degraded.
Pattern 6: Observability
Handling errors correctly is half the problem. The other half is making them debuggable when things go wrong at 2 AM. The difference between a 15-minute fix and a 4-hour investigation is almost always the quality of error context captured at the time of failure.
Structured error logging
Unstructured log lines like "Error: something went wrong" are useless at scale. We log errors as structured data with consistent fields:
interface ErrorLogEntry {
level: 'warn' | 'error' | 'fatal';
message: string;
errorKind: string;
correlationId: string;
service: string;
operation: string;
durationMs: number;
context: Record<string, unknown>;
stack?: string;
}
function logError(entry: ErrorLogEntry): void {
console.error(JSON.stringify(entry));
}
Every error we log includes a correlationId (a unique ID that follows the request across services, so one ID traces the entire lifecycle), an operation (what we were trying to do, like “fetchUser” or “processPayment”), durationMs (a timeout after 30 seconds looks very different from a rejection after 2 milliseconds), and context (relevant inputs and state, with sensitive data stripped).
Correlation IDs through the stack
We generate a correlation ID at the entry point (API request, queue message, cron job) and thread it through every function call:
import { randomUUID } from 'crypto';
interface RequestContext {
correlationId: string;
userId?: string;
traceStart: number;
}
function createContext(userId?: string): RequestContext {
return {
correlationId: randomUUID(),
userId,
traceStart: Date.now(),
};
}
async function handleCreateOrder(
input: OrderInput,
ctx: RequestContext
): AsyncResult<Order, OrderError> {
const user = await fetchUser(input.userId, ctx);
if (!user.ok) {
logError({
level: 'warn',
message: 'User lookup failed during order creation',
errorKind: user.error.kind,
correlationId: ctx.correlationId,
service: 'order-service',
operation: 'fetchUser',
durationMs: Date.now() - ctx.traceStart,
context: { userId: input.userId },
});
return user;
}
// ... continue pipeline
}
When we see an error in our logs, we search by correlationId and get every log entry from that request, across services, across retries, across fallbacks. This turns a mystery into a timeline.
What to include (and exclude) in error reports
We follow a simple rule for error context:
Include: operation name, entity IDs, timing, error codes, retry attempt number, which fallback was used, request path, queue message ID.
Exclude: passwords, tokens, full request bodies (which might contain PII), raw database queries (which might contain user data), internal IP addresses.
A useful gut check: “If this log entry appeared on a public dashboard, would it expose anything sensitive?” If yes, strip it.
What we learned
After applying these patterns across several production systems over a few years, some lessons stand out:
-
Make failures visible in types. If callers need to handle an error, the type signature should show it. The payment incident we opened with happened because
nullwas overloaded to mean three different things. Typed errors would have made the failure mode obvious. -
Errors are data, not interruptions. When we started treating errors as values flowing through pipelines rather than exceptions interrupting control flow, our code got much easier to reason about.
-
Distinguish between “can’t reach the service” and “the service said no.” These need completely different responses. A timeout needs a retry. A validation rejection needs user feedback. A 403 needs an auth check. Collapsing them into a generic
Errorthrows away the information callers need. -
Resilience is layered. Retry handles transient blips. Circuit breakers handle sustained outages. Fallbacks handle total dependency failure. Each layer addresses a different failure duration, and production systems need all three.
-
Invest in observability before you need it. Structured logging, correlation IDs, and error context are boring to set up but they save you during incidents. Every hour we spent on logging infrastructure has come back tenfold.
-
Be consistent within a codebase. Mixed approaches (some functions throw, some return Results, some return null, with no clear convention) are the worst possible outcome. Pick your patterns and document the decision. The specific patterns matter less than consistency.
The goal isn’t to eliminate errors. That’s impossible in distributed systems. The goal is to make failure modes explicit, give callers enough information to respond correctly, and make sure that when things go wrong, we can figure out why in minutes instead of hours.