Service Discovery Without a Service Mesh: What Small Teams Should Actually Do

The service discovery conversation tends to start at the wrong end. The blog posts and conference talks describe Istio, Linkerd, Consul Connect, and the kind of architecture where a hundred microservices need to find each other dynamically across cloud regions. Then a small team with four services reads the same material and concludes that they need the same machinery. They install a service mesh, configure mTLS between every pair of services, and discover six months later that they have spent more time on the mesh than on the products.

The right starting question is not "how do services find each other" but "how often does the answer change, and what happens when it is wrong." For a small studio with services on a single host or a small cluster, the answer is "rarely" and "you notice immediately." That changes everything.

The four levels of discovery, ordered by complexity

The simplest level is hardcoded URLs. Service A talks to Service B at http://service-b:8080. The hostname resolves to a stable address through whatever DNS layer exists. If you are running Docker Compose on a single host, the compose network does this for free. If you are running on Kubernetes, the cluster DNS does it for free. If you are running on bare metal with a hostfile, you are doing it manually. In all cases, the discovery problem is solved by a config value plus a name resolver.

The second level is a reverse proxy with hostname routing. Caddy or Nginx sits in front of all services, and clients hit the proxy with a hostname. The proxy knows where each service lives and forwards the request. This is what we run: every service is reachable as service.anethoth.com from outside, and as http://service:port from inside the Docker network. The proxy is the canonical place that knows what is where, and updating it is a matter of changing one file.

The third level is a key-value store that services poll for endpoints. Consul, etcd, ZooKeeper. Services register themselves on startup, and clients query the store before each connection. This buys you dynamic registration: spin up a new instance and it appears in the store within seconds. The cost is operational: now you have a coordination service that needs to be highly available, and a registration discipline that every service must follow.

The fourth level is a service mesh: a sidecar process per container that handles discovery, load balancing, mTLS, traffic policy, and observability. Istio and Linkerd are the canonical examples. They buy you everything: dynamic discovery, encrypted transport between services, traffic shaping, retry policies, circuit breakers, and detailed metrics. The cost is operational complexity that scales with the number of services. For a small team, the mesh is often more code than the application.

Where the cost actually lives

The marketed cost of a service mesh is the operator install: a Helm chart, some YAML, a quick configuration. The actual cost is everything that comes after.

You have to keep the mesh upgraded as Kubernetes versions roll forward, and the upgrade paths are not always smooth. You have to debug a fundamentally new layer of failure: when a request mysteriously fails, is it the application, the sidecar, the mesh control plane, or something else? You have to run the control plane in a way that does not become a single point of failure for every service. You have to staff someone who understands the mesh well enough to be on call for it.

For a team of three, this is a part-time job that competes with shipping product. The mesh community will tell you that this gets easier with experience, and they are right, but the floor is high. There is a minimum competence required to run a mesh in production, and that floor does not get lower.

What we actually do

Across DocuMint, CronPing, FlagBit, and WebhookVault, the discovery story is "Caddy in front, Docker network behind, and stable hostnames everywhere." Every service has a name in the compose file. Every service is reachable from inside the network at http://name:port. Every service is reachable from outside at name.anethoth.com, with Caddy handling TLS termination and routing.

When we add a new service, the steps are: add it to the compose file, add a Caddy site block for it, and ping the URL to confirm. The service is discoverable from that moment. There is no registration, no health-check polling, no policy distribution. There is also no automatic load balancing across instances, because there are no instances; each service runs as a single container.

If we ever need to run multiple instances of a service, the next step is Caddy with multiple upstream addresses or a small load balancer in front. The step after that, if traffic ever justifies it, is a real cluster with actual orchestration. The step we have not needed and probably will not need is a service mesh.

What about mTLS

The argument for service mesh that has the most weight at small scale is mTLS between services. The reasoning is that internal traffic should be encrypted, that you should not trust the network even inside your own cluster, and that mesh sidecars make this automatic.

The counter-argument, if you are on a single host or a tightly controlled network, is that the threat model is different. The internal Docker network is not addressable from outside. Traffic between containers does not leave the host. The attacker scenario that mTLS protects against (network-level eavesdropping or man-in-the-middle attacks on internal traffic) requires the attacker to already be on the host, at which point they have other paths to compromise that mTLS does not close.

For larger clusters with traffic crossing nodes, the calculation is different and mTLS earns its keep. For a single-host deployment, it is paying for protection against a threat that is not the live one.

The trap of premature mesh

The trap is the same one that catches teams in many other architectural debates: copying patterns from organizations whose problems they do not have. A mesh is the right answer for an organization with hundreds of services across multiple clusters, deployed by multiple teams with varying skill levels, with strong compliance requirements around internal traffic. For that organization, the alternative is worse: ad hoc discovery and security mechanisms that fail in surprising ways.

For a small team with a small number of services on a small footprint, the right answer is to use the simplest mechanism that works and to graduate to more complex mechanisms only when forced to. Hardcoded names plus a reverse proxy is sufficient until it is not, and the moment you find out it is not, you will know exactly what is missing and pick the right next layer.

Signals to watch for

The signals that suggest you have outgrown the simple pattern: services start moving between hosts often enough that hostnames cannot be stable. The number of services exceeds what one Caddyfile can clearly manage. Multiple teams need to deploy independently and start tripping over each other's configuration. Compliance demands explicit per-service authentication that the network layer cannot provide.

The signal that you have not outgrown it: you can reason about which service talks to which by reading one config file. If that is still true, you do not need a mesh. You need to keep shipping the product.

The deeper lesson, repeated across every infrastructure decision, is that complexity is not free and the bill arrives in installments. Pay it when you must. Skip it when you can.