Observability for Small APIs: What Actually Helps

Most observability advice is written for companies with a Datadog budget and a dedicated SRE team. For an API serving thousands of requests a day, the right tools are smaller, cheaper, and stranger than the marketing suggests.

Here is what actually helps when you are running a small production API on a single VPS, with maybe a thousand users, and no team to interpret dashboards for you.

The three pillars are real, but the order matters

Logs, metrics, and traces. The textbook lists them in that order. The textbook is wrong about the order for small operations.

Start with logs. Structured JSON logs, written to stdout, captured by Docker, rotated by the host. That is your foundation. When something breaks at 2am, the first thing you reach for is "what did the request look like?" Logs answer that. Everything else is luxury.

Add metrics only when you need to see patterns over time. "Is latency creeping up?" "Is the error rate spiking on Wednesdays?" Without metrics, you cannot answer those questions, but you can run a profitable API for months on logs alone.

Add tracing last, and only if your system actually has multiple hops worth tracing. A FastAPI app calling SQLite does not need OpenTelemetry. Save the complexity for when a request crosses three services.

What to log, specifically

For each request, log: timestamp, method, path, status code, latency in milliseconds, the authenticated user ID if any, the request ID, the client IP, and the user-agent. That is twelve fields. None of them contain PII you would not already expose in your access logs. None of them contain secrets — auth headers and request bodies stay out.

For each error, log the same fields plus the exception class, the message, and a truncated stack trace. The truncation matters: full stack traces are noisy and storage adds up. Top ten frames is plenty.

For each background job, log start and finish events with the job ID and the duration. If a job throws, log the exception with the job ID. This is how you find ghosts: the job that started but never finished, the job that ran twice, the job that took twenty seconds when it usually takes two.

The request ID trick

Generate a UUID at the edge of your API — a middleware that runs before everything else. Put it in the response headers as X-Request-ID. Include it in every log line for that request. When a customer reports a bug, ask for the request ID; you can find every log line for that exact request in seconds.

This is a five-minute change that is worth more than most observability vendors.

Metrics that matter

The metrics worth collecting on a small API are unglamorous: requests per minute, p50/p95/p99 latency per endpoint, error rate per endpoint, and active connection count. That is enough to tell you when something is wrong and roughly where to look.

Plausible, Prometheus, or even a daily cron job that aggregates from your access logs all work. The question is not "which tool" — it is "do I look at the numbers." A simple page you load weekly beats a beautiful dashboard you never open.

The cron job you need

One scheduled job, every minute, that pings each public endpoint of your API and records latency. Fail loudly when latency exceeds a threshold or the response status is wrong. This is the most useful thing you can build, and it is fifty lines of code.

If you don't want to build it, services like CronPing handle the scheduling and alerting side cheaply. The point is: monitor from outside the box. Internal metrics will tell you the API is healthy at the moment your DNS goes down.

Alerting without screaming

Three alerts, no more, on a small API:

The site is down. External health check fails twice in a row.
The error rate is high. More than 5% of requests in the last 5 minutes return 5xx.
Latency is bad. p99 latency exceeds 2 seconds for 5 minutes.

Anything more aggressive will train you to ignore alerts. Anything less and you will miss real problems.

What you don't need

You don't need APM at this size. You don't need distributed tracing for one service. You don't need a log aggregation cluster. You don't need a 90-day metric retention. You don't need to learn PromQL.

You need: structured logs you can grep, a request ID in every line, a daily latency report you actually read, and three alerts that fire only when something is genuinely broken.

The companies selling you complex observability tools are not wrong about the value of those tools — they are wrong about the size of the operation that needs them. For a small API, simple wins.

The graduation path

The signal that you have outgrown this setup is when you can no longer answer a question by greppping logs. "Show me the 95th percentile latency for endpoint X over the last seven days, broken down by client IP" is the kind of query that needs a metrics store. When that question matters and grep is too slow, you graduate to Prometheus or InfluxDB or whatever fits.

Until then: logs, request IDs, three alerts, one cron job. That is observability. The rest is decoration.