Universal patterns emerge across 22 languages, mapping how vocabularies evolve

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

Highly used words stay closer to other highly used words, defining semantically popular regions. Left: intuitive cartoon. Right: scatter plots calculated using Word2vec and wordfreq datasets. Credit: Guo et al. (Proceedings B, 2026).

Human languages are known to have grown and changed considerably over the course of history, often reflecting technological, cultural, and societal shifts. Studying the evolution of languages can thus offer valuable insight into how human societies and cultures have transformed over time.

Researchers at Fudan, Harvard, and Stony Brook University recently explored the evolution of 22 languages using a combination of artificial intelligence (AI) tools, statistical methods, and a massive cache of real linguistics data. Their paper, published in Proceedings of the Royal Society B Biological Sciences, identifies a common statistical structure for all the languages they examined and the patterns underpinning their evolution.

▶ Don't miss: Forget the caveman myth: Neanderthal brains challenge what we thought we knew

"New words, concepts, and ideas are generated all the time, but do hidden patterns exist that govern which concepts are likely to emerge? Are there simple mathematical models that emulate this process?" Steven Skiena, senior author of the paper, told Phys.org. "We were inspired by the idea that machine learning technologies for representing language semantics (word embeddings) give us a rigorous way to reason about the complex material provided by human language."

Studying language evolution with old and new methods

To study the evolution of human languages and cultures, Skiena and his colleagues used natural language processing (NLP) methods, algorithms designed to analyze texts or speech. These models represent words using so-called word embeddings.

Word embeddings are numerical representations of words that link every vocabulary word with a specific point in a high dimensional semantic space. In this space, words that have similar meanings are represented as nearby points.

"In essence, our paper asks how the vocabulary of different languages are distributed in this feature space, and what kind of mathematical process would create a similar distribution," explained Skiena. "Our paper had an amazingly long gestation: we have been working on this together for more than seven years at this point, and it is great to see where we have finally gotten to."

The researchers leveraged large datasets containing words in English and in 21 other languages, then represented these words as word embeddings. This allowed them to mathematically map their meaning and look for patterns in how they relate to each other.

"We combined linguistic data going back all the way to the Middle Ages and fairly established tools, such as methods from spatial statistics popular in quantitative geography and environmental sciences, with the very modern ML and NLP techniques," explained Sergiy Verstyuk, co-first author of the paper. "This allowed us to uncover some facts about culture that have held true for many different human languages today and throughout our history."

Interestingly, Skiena, Verstyuk, and their colleagues found that the 22 languages they analyzed systematically shared some universal patterns. First, they found that popular words were consistently clustered with other popular words, producing "popular" regions of high-frequency words.

The researchers also uncovered common profiles for the speed of word clustering. To put it differently, they found that vocabulary words were organized in a hierarchical pattern, with the structure of this hierarchy being by and large the same across all the analyzed languages.

"We also observed interesting temporal dynamics, showing that new words are generally created in bursts together with other recent words around them," said Skiena. This is somewhat reminiscent of how biological evolution occurs in rapid periods of significant genetic or morphological change.

Moreover, they found that a so-called Taylor's law, originally discovered for ecological communities and later identified in other biological samples, physical data, and mathematical objects, also holds for vocabulary words. In this case, it is a power-law type of mathematical relationship that connects the mean and variance of word counts sorted by their semantic meaning and historical appearance, which allows us to simultaneously understand the semantics and evolution of language.

Discover the latest in science, tech, and space with over 100,000 subscribers who rely on Phys.org for daily insights. Sign up for our free newsletter and get updates on breakthroughs, innovations, and research that matter—daily or weekly.

Next steps for understanding language evolution

This study offers some interesting new insights into how different languages evolved over the past centuries, and into multiple similarities between them. Collectively, the statistical patterns they uncovered could have implications for a more rigorous understanding of human languages. Even more importantly, there is some evidence that other domains of human culture exhibit similar patterns.

The team's analyses allowed them to identify a stochastic mathematical process that generates sets of words with similar properties. This process could partly explain the mechanics via which human languages were created and how they developed over time.

"We constructed a surprisingly simple model that not only replicates the earlier results on the power-law distribution of word frequencies (i.e., manifesting themselves in a single dimension), but that also accounts for new empirical findings across many additional dimensions (specifically, in the 300-dimensional semantic space and in historical time)," said Verstyuk. "This was achieved by marrying a well-known cumulative‑advantage process with a rarely used von Mises–Fisher probability distribution."

In the future, this work could inspire further linguistics and anthropological studies that leverage NLP methods and other artificial intelligence (AI) tools as well as formal mathematical modeling. "We remain excited about the possibilities of using AI-generated embeddings as a tool for fundamental research in understanding historical processes in cultural evolution—not just for building technological tools," added Skiena.

Written for you by our author Ingrid Fadelli, edited by Gaby Clark, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You'll get an ad-free account as a thank-you.

More information Xingzhi Guo et al, Statistical structure and the evolution of languages, Proceedings of the Royal Society B Biological Sciences (2026). DOI: 10.1098/rspb.2025.2374.