engineering

API Mocking and Recording: Patterns for Testing Against Third-Party APIs

Tests that hit real third-party APIs are flaky and slow. Tests that mock them by hand drift from reality and miss real bugs. The patterns that work in between are fixture-based recording, contract testing against the vendor's spec, and a small set of disciplines that keep the mocks honest.

Anethoth

08 May 2026 — 4 min read

Every backend system that integrates with third-party APIs eventually has to answer the same question: how do you test the integration? Hitting the real API in every test run is slow, flaky, often expensive, and creates real side effects in real systems. Mocking the API by hand is fast but the mocks drift from reality and pass tests that fail in production. Neither extreme is the right answer, and the patterns that work in between are well-understood but rarely articulated as a coherent strategy.

This post covers the three patterns that account for almost all production use — fixture-based recording, contract testing against the vendor's spec, and selective end-to-end runs — and the disciplines that keep the test layer honest. The patterns apply across our four products: DocuMint integrates with Stripe for billing, CronPing integrates with webhook destinations for alerts, FlagBit integrates with Stripe for billing, and WebhookVault forwards captured webhooks to customer-specified URLs.

Pattern 1: Fixture-based recording

The fixture-recording pattern records real responses from a third-party API once, saves them to disk, and replays them in subsequent test runs. Tools like VCR (Ruby, Python), Polly.js (JavaScript), and Wiremock (Java) implement this at the HTTP transport layer: the first time a test runs, the tool intercepts the outgoing request, makes the real call, records the response, and saves it to a YAML or JSON file in the repo. Subsequent runs match incoming requests against the saved fixtures and replay the recorded responses.

The pattern is cheap to set up — typically two or three lines of test setup — and the recorded fixtures are real, so the test exercises real code paths against real response shapes. The first benefit is speed: replaying a fixture is microseconds vs the hundreds of milliseconds of a real API call. The second is determinism: tests no longer depend on the third-party API being up. The third is debuggability: when a test fails, the fixture file shows exactly what the test thought the API would return.

The discipline that keeps fixture-recording honest is a periodic refresh policy. Fixtures recorded once and never re-recorded drift from reality as the third-party API evolves. The pattern that works is to have a separate CI job — daily or weekly — that re-records all fixtures against the real API, runs the test suite against the new fixtures, and fails if the new fixtures break the tests. This catches API changes early without paying the per-test-run cost of hitting the real API.

Pattern 2: Contract testing against the vendor's spec

The contract-testing pattern uses the vendor's published OpenAPI spec (or equivalent) as the source of truth for what the API returns. Tools like Pact (originally Ruby, now polyglot), Schemathesis, and the broader OpenAPI test ecosystem generate test cases from the spec, validate that mock responses conform to the spec's schemas, and catch the case where a hand-written mock returns a shape the real API would not return.

The pattern is structurally stronger than fixture-recording for the cases the spec covers, because the test asserts against the contract rather than against a snapshot of one moment in the contract's history. If Stripe documents that a Charge object always has a status field with one of five enum values, a contract test catches a mock that returns a sixth value. A fixture-recording test only catches the case if the original recording happened to exercise the relevant code path.

The cost is the spec-quality dependency. Contract testing is only as good as the spec; for vendors whose spec is incomplete, out of date, or non-existent, the pattern provides little value beyond what hand-written mocks would. For vendors whose spec is high-quality (Stripe, GitHub, Twilio, AWS in places), contract testing is the strongest pattern available. For vendors whose spec is missing or weak, fixture-recording is the more reliable approach.

Pattern 3: Selective end-to-end runs

The end-to-end pattern runs a small number of real-API tests against the real API on a different cadence from the main test suite. The main suite uses fixtures or contract tests for speed; a separate suite — running nightly, or on every merge to main, or before every release — exercises a small set of critical paths against the real API in a sandbox or test mode.

The pattern catches the cases the other patterns miss: real API rate limits, real authentication failures, real network conditions, real interactions with a sequence of API calls that depend on state created by earlier calls. For Stripe integrations, this looks like a nightly job that runs the full happy path of customer-creation, subscription-creation, payment-method-attachment, and invoice-finalization against Stripe's test mode. For webhook-receiving integrations, this looks like a periodic job that triggers a real event and waits for the webhook to arrive.

The discipline that keeps end-to-end tests useful is keeping the suite small and high-signal. A handful of tests that exercise real critical paths is more valuable than a hundred tests that exhaustively cover edge cases against the real API. The edge cases belong in fixture-recorded tests; the end-to-end suite is for catching the cases where the real API behaves differently from the fixtures.

The disciplines that hold it together

The first discipline is a strict separation between tests that hit the real API and tests that do not. The default test command runs only mocked tests; running real-API tests requires an explicit flag or a different command. This prevents the real-API tests from creeping into the main suite where they cause flakiness and cost.

The second discipline is per-vendor rather than global. Different vendors have different spec quality, different rate limits, different test modes, and different evolution rates. A team integrating with Stripe and a Twilio-clone with no spec should not use the same test strategy for both.

The third discipline is keeping the fixtures small. A fixture file with a multi-megabyte response body is a sign that the test is exercising more of the API than it needs. Trim fixtures to the fields the test actually uses; this makes them faster to load, easier to diff in code review, and more obvious when they need to be updated.

The fourth discipline is making fixture refresh easy. The friction of re-recording fixtures determines how often it happens, which determines how stale the fixtures get. A single command that re-records everything against the real API, with credentials managed through the team's secret store, is the difference between weekly refreshes and never-refreshes.

The deeper observation

The right test strategy for third-party APIs is layered rather than single. Fixture-recorded tests for fast feedback on most cases; contract tests for the cases where the vendor's spec is high-quality; selective end-to-end tests for the critical paths. Each layer catches a class of bug the others miss, and the operational cost of running all three is much lower than the cost of either hitting the real API in every test or hand-writing mocks that drift from reality. The teams that get this right have thought about it as a system; the teams that get it wrong have usually picked one extreme and lived with the costs.

API Mocking and Recording: Patterns for Testing Against Third-Party APIs

Anethoth

Pattern 1: Fixture-based recording

Pattern 2: Contract testing against the vendor's spec

Pattern 3: Selective end-to-end runs

The disciplines that hold it together

The deeper observation

Read more

How Treehoppers Communicate Through Plant Stems: The Strange Substrate-Borne Vibrational Network

The Forgotten History of the Microwave Oven: How Radar Engineering Reshaped the Kitchen

Postgres pg_settings: Reading and Reasoning About Configuration at Runtime

Designing API Webhook Payloads: Snapshots vs References and the Right Default for B2B SaaS