When and where was proto-Indo-European?

A new study by Remco Bouckaert and colleagues attempts to place the origin of Indo-European languages by using an epidemiological population model, essentially plotting the "spread" of languages from a common source Bouckaert:2012.

To test these two hypotheses, we adapted and extended a Bayesian phylogeographic inference framework developed to investigate the origin of virus outbreaks from molecular sequence data (13, 14). We used this approach to analyze a data set of basic vocabulary terms and geographic range assignments for 103 ancient and contemporary Indo-European languages (1517). Following previous work that applied Bayesian phylogenetic methods to linguistic data (13), we modeled language evolution as the gain and loss of cognates (homologous words) through time (1820). We combined phylogenetic inference with a relaxed random walk (RRW) (14) model of continuous spatial diffusion along the branches of an unknown, yet estimable, phylogeny to jointly infer the Indo-European language phylogeny and the most probable geographic ranges at the root and internal nodes. This phylogeographic approach treats language location as a continuous vector (longitude and latitude) that evolves through time along the branches of a tree and seeks to infer ancestral locations at internal nodes on the tree while simultaneously accounting for uncertainty in the tree.

Diffusion models applied to spatial data tend to place the origin at the center of the present geographic distribution. That's just the simplest way to explain any geographic distribution under the diffusion model, which assumes that people act like random particles.

By contrast, Phylogeographic models tend to place the origin near the point with maximal clade distance. One ancient Anatolian language, Hittite, is attested in written records and according to the phylogenetic analysis is an outgroup to other, more recent Indo-European languages. Armenian, Greek, and Albanian also belong to relatively deep clades, and they geographically flank Anatolia in different directions.

So in this case, both diffusion and phylogenetic approaches point toward Anatolia as the most parsimonious origin.

Additionally, when the centers of diversification of the major Indo-European families are considered (e.g., Celtic, Romance, and Indo-Aryan), the geographic center of their distribution is Anatolia. Figure 2 of the paper illustrates the geographic ranges estimated as origins for the different clades within Indo-European:

Figure 2 from Bouckaert et al. 2012

Looking at the picture, Anatolia looks like ground zero for the viral spread of Indo-European languages.

OK, so the logic of the model pretty much inevitably leads to the conclusion. Anatolia is at the geographic center of the early Indo-European families, and is geographically central to the earliest branches of the language tree. But should we believe it? Languages, after all, don't spread exactly like viruses. And viruses don't spread by diffusion much of the time -- if they did, the movie Contagion would have had a lot more boring plot.

I have no strong reason to be skeptical of the main conclusion, that the first Indo-European language may have originated in Anatolia. But I do note that it's strongly influenced by the evidence we happen to have about ancient languages. If we had a stronger record of the ancient languages of Central Asia, who knows what we might find? Tocharian, in the Tarim Basin of western China, was also a relatively deep clade in the Indo-European phylogeny, spoken within the last 2000 years. Could there have been others?

Also, Razib Khan points out some issues with the dates that the model attributes to branch points in the tree: "There are more things in prehistory than are dreamt of in our urheimat".

Bouckaert and colleagues set up an opposition between two hypotheses for the origin of the Indo-European. The first derives the family from Anatolia more than 8000 years ago, possibly shortly after the origin of agriculture in the Fertile Crescent. This is more or less the Colin Renfrew model of Indo-European, which posits that the language family was able to spread due to the population expansion of agriculturalists. In this model, the first Neolithic peoples of Europe should have been Indo-European speakers.

The alternate hypothesis is that Indo-European originated on the steppes of Central Asia and Eastern Europe. This is more or less the Marija Birute Gimbutas model, where early steppe peoples spread westward carrying Indo-European with them. Some linguists and archaeologists have strongly favored this model because of the words reconstructed as part of the proto-Indo-European language, which include many technological and ecological elements that would have been familiar to steppe pastoralists of 4500-6000 years ago.

This seems like a clear dichotomy -- either Indo-European was early and spread with agriculture, or it was later and spread into regions already agricultural. In the first case, the language spread was mostly caused by demographic growth, in the latter case, other mechanisms such as elite dominance and conquest may have played more important roles. So it is interesting that this paper, after concluding an early Anatolian origin was supported by the data, actually argues for a much softer, intermediate position:

Despite support for an Anatolian Indo-European origin, we think it unlikely that agriculture serves as the sole driver of language expansion on the continent. The five major Indo-European subfamiliesCeltic, Germanic, Italic, Balto-Slavic, and Indo-Iranianall emerged as distinct lineages between 4000 and 6000 years ago (Fig. 2 and fig. S1), contemporaneous with a number of later cultural expansions evident in the archaeological record, including the Kurgan expansion (57). Our inferred tree also shows that within each subfamily, the languages we sampled began to diversify between 2000 and 4500 years ago, well after the agricultural expansion had run its course.

I think this is the most important passage of the paper. Reading between the lines, it says that the origination point for Indo-European languages simply may not address the archaeological record. What if Indo-European got its start in Anatolia 10,000 years ago, but many of the modern branches of Indo-European within Europe -- Celtic, Italic, Germanic -- all moved into Europe in several separate waves, starting less than 6000 years ago from the Pontic Steppe? We have pretty good genetic evidence now that the first farmers in Europe were not very much like recent Europeans. We need later migrations into Europe from elsewhere to explain the genetic record, and the archaeology (and later, history) provides plenty of reasons to think that later migrations were important.

So, there we are. Even though the present study supports an early, Anatolian origin for Indo-European, other evidence rejects the simple Colin Renfrew model. The present Indo-European families did not reach their present geographic distributions with the first agriculturalists. That means we need to look at more complex intermediate steps to explain how current and historic Indo-European languages got to their attested locations. The steppic model might well explain the spread of languages between 6000 and 4000 years ago, even if they shared earlier ancestors that fit the Anatolian model.