Zipf's Law and the Strange Mathematics of Word Frequencies
The most common word in any language is roughly twice as frequent as the second most common, three times as frequent as the third, and so on. The pattern holds across languages and centuries. Why it does is one of the deepest unsolved questions in linguistics.
Pick any reasonably long text in any language. Count how often each word appears. Sort the words by frequency, with the most common at rank 1, the second most common at rank 2, and so on. The frequency of the word at rank N will be approximately proportional to 1/N. The most common word will be roughly twice as frequent as the second most common, three times as frequent as the third, ten times as frequent as the tenth.
This is Zipf's law, named after the Harvard linguist George Kingsley Zipf who described it in detail in his 1949 book Human Behavior and the Principle of Least Effort. The law holds across English, Mandarin, Russian, Hungarian, Greek, and every other natural language anyone has bothered to test. It holds in modern texts and in ancient ones. It holds in conversational speech, in legal documents, in poetry, in computer-generated text from large language models. The pattern is so robust that you can use deviations from it to detect machine translation, automated text generation, and forgery.
What is strange is that nobody knows for sure why the law holds. The half-century of attempts to explain it have produced multiple proposed mechanisms, none of them quite satisfactory, and the question is still actively studied.
What the law actually says
Zipf's law is a power law, specifically a power law with exponent very close to 1. In modern statistical terms: the probability of observing a word at rank N is approximately proportional to N⁻¹. On a log-log plot of frequency versus rank, the data forms a straight line with slope -1.
The constant of proportionality depends on the text's vocabulary size and length. For a typical English text, the most common word ("the") accounts for about 7% of all word tokens. The top 10 words account for about 25%. The top 100 words account for about 50%. After that, the long tail is so long that even an enormous text contains many words that appear only once or twice — these are called "hapax legomena," from the Greek for "things said once."
The law is a tendency rather than an exact rule. Real texts deviate from a perfect 1/N relationship at both ends — extremely common words ("the," "of," "and") are slightly less frequent than the law predicts, and the long tail has more rare words than the law predicts. The deviations are small enough that the law is a useful first approximation, large enough that they have generated a small literature of their own.
The first proposed mechanism: least effort
Zipf's own proposed explanation was the principle of least effort. Speakers want to use as few words as possible to convey their meaning. Listeners want to interpret as few distinct words as possible. The compromise between these two pressures, Zipf argued, would naturally produce a frequency distribution where a small number of words handle most of the communicative load while a large vocabulary remains available for precision.
The argument was suggestive but not rigorous. Zipf could not derive the specific 1/N shape from his principle, and his book was more a survey of where the law showed up — in city sizes, in income distributions, in the frequencies of musical notes — than a derivation of why.
The second proposed mechanism: random typing
The most uncomfortable proposed explanation, from the linguist's perspective, came from Benoit Mandelbrot in the 1950s and was later sharpened by mathematician George Miller in 1957. They showed that if a monkey types randomly, hitting any of K letters or a space with equal probability, the resulting "words" (separated by spaces) follow Zipf's law almost exactly.
This is a formal result, not a metaphor. Random typing produces a frequency distribution that is statistically indistinguishable from the distribution of words in real human languages. The implication, on its face, is that Zipf's law tells us nothing about language — it is just what happens when you have a lot of distinct items being generated by some weakly correlated process.
Linguists found this unsatisfying for the obvious reason: real languages are not random typing. The words have meanings, the meanings constrain co-occurrence, the syntax limits which sequences are grammatical. A theory that predicts the same frequency distribution from a process that has none of these features cannot be capturing what is special about language. The Mandelbrot-Miller result remains a serious challenge to any deeper explanation: any proposed mechanism has to explain why language produces a Zipf distribution that random typing also produces, without dismissing the role of meaning entirely.
The third mechanism: communicative optimization
The most developed modern explanation comes from information theory, specifically from work by Steven Piantadosi, Harry Tily, and Edward Gibson at MIT in the 2010s, building on earlier work by Manin and others. The argument is that Zipf's law emerges naturally if a language is optimized for communication subject to two constraints: words need to be short enough to produce, and the system needs to convey information at a roughly constant rate.
The mathematical setup is that for each word, there is a "cost" (related to its length or production effort) and an "information value" (the negative log of its probability in context). Optimizing communication means matching costs to information values: common situations should use cheap words, rare situations can afford expensive words. The math works out, under reasonable assumptions, to produce a Zipfian frequency distribution.
This explanation has the advantage of being rigorous and the disadvantage of being hard to test against the random-typing alternative. Both predict Zipf's law; distinguishing them requires looking at finer-grained features of word distributions, which is where active research lives now.
Heaps' law and vocabulary growth
Closely related to Zipf's law is Heaps' law, which describes how vocabulary size grows with text size. If you have a corpus of N words total, the number of distinct words V you have observed grows like V ≈ K·N^β, where β is between 0.4 and 0.6 for most natural languages.
This is a power law of a different kind. It says that doubling the size of a text does not double the vocabulary; it adds about 50% more distinct words. The reason is that most of the words in any text are common ones that you already encountered in the first few thousand words, and the new words you find as the text grows are increasingly rare.
Zipf's law and Heaps' law are mathematically related — given Zipf's distribution, you can derive Heaps' law as a consequence. But they have independent practical interest. Heaps' law is what tells you, for instance, that a 1-million-word training corpus for a machine translation system probably contains about 50,000 distinct words, while a 100-million-word corpus contains about 500,000. The growth is sublinear, and predictably so.
Where else the law shows up
Zipf's law generalizes far beyond words. The same 1/N rank-frequency pattern appears in:
- City populations (the largest city in a country is about twice the size of the second largest)
- Wealth distributions (in approximate form; the modern reality is more skewed)
- Website traffic (the most-visited site of any niche has roughly twice the traffic of the second-most-visited)
- Citation counts in academic papers
- The frequencies of words in code (for programming languages, not just natural languages)
- The sizes of files on a typical hard drive
- Earthquake magnitudes (the Gutenberg-Richter law is a Zipfian distribution)
The general phenomenon is sometimes called a "power law distribution" or, when the exponent is exactly 1, a "Zipfian distribution." There is a small academic industry around proposing mechanisms for power laws — preferential attachment, self-organized criticality, log-normal mixtures — and the empirical literature is full of cases where one mechanism fits one domain better than another. There is no single explanation that covers all the cases.
What the law might mean
The honest scientific answer, after seventy-five years of study, is that we do not know definitively why Zipf's law holds for language. We know it does. We know that random typing produces it, that information-theoretic optimization predicts it, that the law generalizes far beyond words, and that no proposed mechanism is fully satisfying.
The unsatisfying nature of the answer is itself interesting. Zipf's law is a regularity that is almost as universal as physical laws like gravity or thermodynamics, yet it lacks a comparably rigorous foundation. It might be that no single mechanism is responsible — that multiple distinct processes (random sampling, optimization, dictionary construction, communication) all happen to produce similar distributions, and the apparent universality is a kind of mathematical coincidence.
Or it might be that there is a deeper principle we have not yet identified, something like an entropy maximization condition that all these systems approximately satisfy. Researchers like Cosma Shalizi have argued for something like this view, treating Zipf's law as the most parsimonious distribution consistent with certain constraints on the underlying generating process.
What the law lets us do
Whatever its cause, Zipf's law is a useful tool. Search engines use it to weight terms (rare words are more informative than common ones, which is the basis of TF-IDF). Compression algorithms exploit it (Huffman coding gives short codes to common words). Language models implicitly learn it during training. Linguists use deviations from it to compare texts: a text that fits Zipf's law badly is unusual in some way that might be worth investigating.
Most strikingly, large language model outputs follow Zipf's law nearly as well as human texts do, which is one of several quiet pieces of evidence that the models have learned something deep about the statistical structure of language even if we are not sure what to call it.
The deeper point is that Zipf's law is one of the cases where a mathematical regularity precedes an explanation. We discovered the pattern, found it everywhere, and have not yet found the satisfying account of why. That is more common in science than the standard narrative suggests, and it is one of the more honest places where mathematics, linguistics, and information theory all admit they are still working it out.