Distributed Tracing for Small Teams: The Minimum Viable Setup

Distributed tracing is one of the three pillars of observability, alongside metrics and logs. The textbook treatment makes it sound like a project: instrument every service with OpenTelemetry, run a tracing backend, learn a query language, train the team. For a small team running three or four services, this is overkill that will keep tracing on the someday list forever.

The minimum viable distributed tracing setup is much smaller than the textbook suggests, and the value-per-line-of-code curve is sharply concave. The first hour of effort gets you most of the benefit. The rest of the effort is fitting and finishing for cases you may never encounter at small scale.

What you actually need

The thing distributed tracing answers is the question "what happened during this request, across all the services it touched, in chronological order with timing?" To answer that question, you need three primitives.

First, a unique trace ID generated at the edge of your system (typically by the load balancer or the first service to receive the request) and propagated through every subsequent service. Second, a span ID for each operation within the trace, with a parent span ID linking it to its caller. Third, a way to record the spans somewhere you can query them.

That is the entire conceptual surface. Everything else (automatic instrumentation, sampling, baggage propagation, exemplar linking) is optimization for scale that small teams do not need.

The minimum implementation

For a Python or Node service, the minimum viable trace ID propagation is a piece of middleware that does three things: read the trace ID from the incoming request header (or generate one if not present), put it in a thread-local or async-context variable so the rest of the request can access it, and add it to the headers of every outgoing HTTP call.

The middleware is fifteen lines of code per service. The shared header is conventionally X-Request-ID or traceparent (W3C Trace Context); we recommend traceparent because it is a public standard and is what every observability tool already understands. The format is traceparent: 00-<32-char trace id>-<16-char span id>-01.

Every log line your service emits should include the trace ID. Every error report should include the trace ID. Every outgoing HTTP request should propagate the trace ID. With nothing else in place, you can already grep your logs across services for a specific trace ID and reconstruct what happened during a single request.

When to add a tracing backend

Logs with trace IDs work for chronological reconstruction but they are flat: you can see each event but not the structure of the request. To see "service A called service B which called service C in parallel with D," you need a tracing backend that understands span hierarchies.

The lightweight options are Jaeger (single binary, Apache 2.0, runs on Docker with a SQLite or Cassandra backend), Zipkin (older, also single-binary), and Tempo (from Grafana, designed to scale). For a small team, Jaeger is the easy choice. The all-in-one image runs in 200 megabytes of RAM and stores traces in memory; for any serious use you swap to Cassandra or Elasticsearch storage, but the in-memory version is enough to get started and to evaluate whether tracing is worth the effort.

Hosted options exist for teams that don't want to run another service: Honeycomb, Lightstep (now ServiceNow), Datadog, New Relic. They all charge per ingested span. For a team doing a million requests per day across four services, the bill is in the low hundreds of dollars per month, which is a reasonable price for not running another piece of infrastructure.

The instrumentation question

Once you have a backend, you need to actually emit spans. The two paths are manual instrumentation (you write code that creates spans and records timing) and automatic instrumentation via OpenTelemetry's auto-instrumentation libraries (you install a library and it monkey-patches your HTTP client, database driver, and web framework to emit spans automatically).

Automatic instrumentation sounds appealing and has a real downside: it generates an enormous number of spans, including for operations you do not care about. A typical Django application with auto-instrumentation enabled produces a hundred spans per request, most of which are template renders and ORM queries that add noise without insight. The signal gets buried in the spans-per-second.

The right starting point for a small team is manual instrumentation of the operations you care about: request entry, database queries (one span per logical query, not per ORM operation), outgoing HTTP calls, and any in-process operation that takes more than ten milliseconds. That is typically twenty to fifty spans per request, each one of which represents a meaningful unit of work. You can read a trace and understand what happened.

Sampling

The textbook tells you to sample at one percent at scale. The textbook is right about scale and wrong about the minimum case. At one million requests per day, a one percent sample is ten thousand traces, which is more than you will ever look at. At ten thousand requests per day (which is roughly what a small SaaS does), a one percent sample is a hundred traces, which means most user-reported problems are not in your sample.

The right small-team sampling rule is: sample 100% of errors, sample 100% of slow requests (above a threshold like p95 latency), and sample some small percentage of normal requests. The result is a trace store that contains every interesting trace and a representative sample of normal traffic. The storage cost is negligible at small-team request volumes.

What you get

The payoff for the minimum viable setup is the ability to answer questions you previously could not answer. "Why was that request slow?" becomes a trace lookup that shows the latency breakdown across services. "What did this request actually do?" becomes a span list. "Where is the bug?" becomes a clickable link from the error report to the trace.

The payoff is not "we never debug production issues again." Tracing does not replace logs or metrics; it complements them. Metrics tell you something is wrong. Tracing tells you what happened during a specific bad request. Logs tell you the details. The three pillars work together, and the smallest team that has all three is well-equipped.

The four APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault propagate W3C traceparent headers between every service and emit structured logs with trace IDs. We have not yet adopted a tracing backend, because at our current scale grep on logs is sufficient. The day we adopt one, the data will already be there waiting. That is the right order to do it in: propagate first, ingest later.