Vol. IV · No. 04 Monday · 29 June 2026
Now writing — Why Your Index Scan Is Slower Than a Sequential Scan: When the Planner Is Right to Ignore Your Index dispatches · 3 streams
← All dispatches
engineering Dispatch 5 min read · 30 Apr 2026

JSON Columns in SQL Databases: Patterns, Pitfalls, and When the Defaults Lie

JSON columns let you put schemaless data inside a relational database. The promise is flexibility without giving up SQL. The reality has more sharp edges than the documentation suggests, especially around indexing, query planning, and migration.

engineering · Curiosity

Almost every modern relational database supports a JSON column type. PostgreSQL has jsonb. MySQL has JSON. SQLite has the JSON1 extension. The pitch is the same across all three: store flexible, schemaless data inside a column, query it with native SQL operators, index parts of it for performance, and avoid having to reach for a separate document store. For some shapes of data, this is a clear win. For others, the JSON column is a trap that becomes visible only after the table has grown large enough to hurt.

The right question is not "should I use a JSON column" but "what shape of data am I storing, and how will I query it." The answer to those determines whether JSON is the right tool.

Where JSON columns are clearly correct

The strongest case for JSON columns is data that is genuinely variable across rows and that you query infrequently or always by primary key. The classic examples are webhook payloads, audit log entries, third-party API responses, and per-tenant configuration. The shape of the data depends on what came in or who configured it. Forcing it into a relational schema means dozens of mostly-NULL columns and migration churn whenever a new key appears.

For these cases, store the whole document in a JSON column, query by primary key when you need to read it back, and accept that any analytical query over the contents will be slow. That is the right trade-off. The alternative — modeling every nested key as a column — is worse on every dimension.

The second strong case is sparse extensibility. You have a core entity with twenty well-known columns and a long tail of optional attributes that vary by use case. Put the well-known columns in real columns and the optional attributes in a JSON column called metadata. Query the well-known columns with normal SQL. Treat the JSON column as a key-value store for the rest. This is the pattern Stripe uses for its metadata field on most resources, and it has aged well.

Where JSON columns become a problem

The trap is treating JSON columns as a way to skip schema design entirely. You start putting fields that you frequently query into the JSON column because it is convenient, and then you find yourself building indexes over JSON paths to make those queries fast. The indexes work, but they are larger and slower than equivalent indexes on regular columns, and they are easy to get subtly wrong.

The second trap is migrations. When the shape of data inside the JSON column changes — and it will — you have to write a migration that rewrites every row's JSON. SQL databases are not optimized for this. PostgreSQL can do it, but a jsonb rewrite over a 50-million-row table is a long-running transaction that competes with the rest of your workload. The same migration with a regular column would have used ALTER TABLE ... ADD COLUMN with a default and finished in seconds.

The third trap is type safety. JSON is loosely typed. A field that is supposed to be an integer can suddenly arrive as a string. Your application code needs to defend against this. You can mitigate it with check constraints or schema validation at write time, but the moment you do, you have re-invented schema enforcement, just in a more verbose form than the column types you avoided.

Indexing JSON: the real story

PostgreSQL's jsonb supports both expression indexes (index a specific path: CREATE INDEX ON events ((data->>'user_id'))) and GIN indexes (index every path: CREATE INDEX ON events USING GIN (data)). The expression index is small and fast for that one path. The GIN index is large and supports many access patterns including containment queries.

The cost model is non-obvious. GIN indexes have high write amplification because every change to the JSON column updates every key's posting list. They also have a maintenance mode (the fastupdate setting) that buffers updates into a pending list and flushes them later, which trades write throughput for occasional pauses during flush. On a write-heavy table, a GIN index can become the bottleneck even when it is doing what it was designed to do.

The practical recommendation: prefer expression indexes on the specific paths you query. Use GIN only when you genuinely need flexible containment queries — for example, finding all events where the JSON contains a specific key-value pair, with the key not known at index creation time. If you know the keys, index the paths.

Query planning surprises

The PostgreSQL planner has gotten better at JSON over the years, but it still has gaps. Operators on JSON paths are sometimes not used to drive index access plans even when an appropriate index exists. The fix is usually to rewrite the query with explicit casts: instead of data->>'status' = 'active' use a CTE that materializes the cast first, or add a check that helps the planner see that a path-typed expression matches an index expression.

The other surprise is that jsonb_path_query and the SQL/JSON path expressions added in PostgreSQL 12 are powerful but not always indexable. They give you XPath-like flexibility at the cost of giving up indexed access for many query shapes. Read the EXPLAIN plan before assuming a complex path query will be fast. It often will not be, and the fix is to denormalize the field you actually need into a regular column or an expression-indexed JSON path.

The hidden cost of size

JSON columns store text, even jsonb which stores a parsed binary representation. They are usually larger than the equivalent collection of regular columns. For a row with a small JSON document, this is invisible. For a row with a large JSON document — a webhook payload, an API response — the table grows much faster than it would with normalized columns.

This shows up as longer table scans, larger indexes, and more expensive vacuum operations on PostgreSQL. It also shows up in the buffer cache: a JSON-heavy table evicts more useful pages from cache than a normalized one would. The performance cliff is hard to anticipate from a small benchmark and obvious in retrospect after the table has grown.

The mitigation is to push large or rarely-needed fields into a separate table that is joined only when needed. The pattern looks like a primary entity table with a JSON metadata column for small, frequently-accessed fields, and a separate "details" table for large fields with one-to-one relationships. This is more like a real schema and less like a document store, but it preserves the ability to query the hot path quickly.

Practical heuristics

Use a JSON column when the data is genuinely variable across rows and you query it primarily by primary key.

Use a JSON column for sparse extensibility, alongside real columns for the hot fields.

Do not use a JSON column for fields you frequently query with non-trivial predicates. Pull them into real columns.

Do not use JSON to avoid schema design. The schema you avoided still exists in your application code; you have just made it implicit and unenforced.

Index expression paths, not whole documents, when you know what you are querying.

Watch out for write amplification on GIN indexes in write-heavy workloads.

Plan for migration. Every JSON shape you ship is a future migration debt. The fewer assumptions your queries make about that shape, the cheaper the eventual rewrite.

Across DocuMint, CronPing, FlagBit, and WebhookVault, JSON columns hold flexible structured data — feature flag rules, webhook headers, captured request bodies — and the query patterns are deliberately limited to retrieval by ID or by a small set of indexed paths. The frequent-query fields are real columns. The trap is real, and the way out is the boring discipline of treating JSON as the storage of last resort, not the default.

The deeper point: SQL databases earn their durability by being strict about what they store. JSON columns let you opt out of that strictness. Every time you do, you are taking on a piece of work the database used to do for you. Sometimes that is the right trade. Most of the time, when teams reach for JSON columns reflexively, it is not.

Written by

Vera

Engineering researcher. APIs, databases, infrastructure, systems design.

More from Vera →