The Language Tree

This post was chosen as an Editor's Selection for ResearchBlogging.orgIn the 17th century, the Japanese shoguns decided that the only Westerners allowed to trade with the Japanese empire, would be the Dutch. By doing so they not only opened up their country to sugar, cotton and silk, they also unintentionally exposed the Japanese language to Dutch words and terminology. Many Dutch naval terms and words related to trading were introduced to the Japanese language this way. Peaceful contact and trade are just one way in which  languages are exposed to new words. Strife, conflict and conquest are the other side of the coin. The Normans and Vikings who conquered the British isles also brought new words to the English language for example. Words, for one, don’t care about language barriers.

Something similar happens in nature, with the only difference that organisms trade genes instead of words. Bacteria are especially competent gene sharers. They have multiple ways to trade, steal, and absorb foreign genes. Many of them carry special machinery that helps them to bring foreign genes into their cells. Some bacteria build bridges to other bacteria, over which genes are shuttled. Other genes get shared with the help of viruses. When a bacterial virus hops from one bacterium to another one, it might accidentally take a gene from its former host along for the ride. In the past thousands of genes have made the jump from one bacterial species to the next.

For over 200 years, the Japanese empire's only contact with the Western world was with the Dutch, the only Westerners they allowed to trade with Japan. The Dutch were not allowed to enter the mainland, and were confined to the artificial island of Dejima, in Nagasaki bay.

All this borrowing of genes has given biologists headaches. Ever since Charles Darwin drew his famous sketch of the tree of life, trees have been a staple image in evolutionary biology. It’s easy to see their appeal. Ancestral species give rise to their descendants, like branches grow from the stem of a tree. Neighbouring branches usually give you some clues on the characteristics of a certain creature. Once you know that humans belong to  the branch of great apes, you could assume that humans have big, forward-facing eyes like their ape cousins. But when genes can be shared between branches as easily as files on a peer-to-peer network, such interpretations become harder to make. Two bacteria could be close cousins, while also acquiring vastly different sets of genes from more distant branches. One could have taken up genes that have transformed it into a deadly killer, while the other has become a harmless laboratory pet.

Linguists face similar problems if they use a tree to represent language evolution. English is often placed in the Germanic branch of the language tree, but this placement hides that only 26% of English vocabulary has a Germanic origin. In classical evolutionary trees, species and languages are isolated and separated from their cousins on the other branches. In reality they are fluid and dynamic systems, from which genes and words are free to leave and enter.

Tal Dagan and her colleagues realized networks represent that reality better than trees. In the networks that they build, a link is placed between every two species that exchanged genes at some point in time. The skeleton of this network still consists of the familiar evolutionary tree. Together, the network and tree display the vertical and horizontal flows of genes during evolution. While Dagan and colleagues initially made these networks to map gene transfer in bacteria, they have recently applied the same method to see how the borrowing of words affects the evolution of language. This resulted the wonderful image you see below, which shows the evolution of Indo-European languages. All the modern language families branch are placed at the sides. If you trace back all the outer branches to the centre, you travel back in time towards the common ancestor of all these languages: the Proto-Indo-European language. The languages are not only connected via the main branches projecting from the centre, but also by numerous blue and green lines. Every one of these coloured lines means that words have been shared between the languages that they connect.

The evolution of Indo-European languages. Every language has its own position on the Indo-European family tree, but words flow between different languages and language families.

This network only highlights the borrowing of words that belong to a very basic vocabulary. It includes words for body parts, pronouns and common verbs. You would expect that these words are fairly stable and rarely borrowed from other languages. Yet Dagan and her colleagues show that per language,  8% of these basic words have been borrowed from other languages. If the core of our languages already reveal the traces of lexical borrowing, no modern language can claim to be free from borrowed words.

But how could Dagan and her colleagues estimate the frequency of lexical borrowing throughout history? Sadly, the ancestral languages that hold the key to that question have been lost to time. Still it is possible to make some claims about these languages. For instance, it is reasonable to assume that the number of different words in ancestral languages was the same as in current languages. Now, if  every single word in every current language was the direct descendant from a corresponding word in the ancestral language, the ancestral vocabulary must have been huge and redundant. This is obviously unrealistic. By allowing some borrowing events between languages, the ancestral vocabulary no longer has to contain every single word in history, and can grow a little bit smaller. But with too much borrowing, the ancestral vocabulary becomes unrealistically small. The number of borrowing events that brings the ancestral vocabulary size in line with that of current languages, should be the best guess. From this approximation Dagan and colleagues could estimate the minimum frequency of ‘horizontal word transfer‘.

The abundant borrowing of words really is an inspiring testament to the flexibility and adaptability of human languages. While we may come from different places and speak in different tongues, we have never become truly separated from each other.

Do you want to read more on the evolution of language? Check out this post by Ed Yong on the evolution of the English language in a corpus of books digitalized by Google.

Source for figure 1. Figure 2  from reference.

Nelson-Sathi S, List JM, Geisler H, Fangerau H, Gray RD, Martin W, & Dagan T (2010). Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings. Biological sciences / The Royal Society PMID: 21106583

You might also like:

    A Short History of the Oldest Tree on Earth
    The algae’s accent
    Out of Gondwana: the early evolution of bees

3 comments to The Language Tree

  • lylebot

    Doesn’t this focus on vocabulary result in a model of language as little more than a collection of words? What about syntax and grammar? Changes in pronunciation when borrowing? Word frequency (there may be fewer Germanic-origin words in English, but those words are used a lot more frequently)? I can’t accept that vocabulary is the most important feature of a language.

    • Thanks for your comment lylebot! I absolutely agree with you that a language is more than a collection of words. That would be a gross oversimplification, just like saying organisms are collections of genes. Still, both words and genes are the atomic units of information in both languages and species. Many computational approaches have been developed for comparing them. We still can glean a lot of information and meaningful interpretations from these approaches (as this research shows).
      Of course any linguist would like to take syntax and grammar into account in such comparisons, just like an evolutionary biologist would want to compare developmental and biochemical pathways, gene expression and so on when he is comparing species. But I don’t think we’re technically there yet. Defining orthology for species and languages is already difficult enough as it is.
      The suggestion that you make for studying the differential use sounds of loanwords sounds really interesting and plausible (maybe some research has already been done on this?). Such things become easier to study with large-scale investigations that become possible with the release of large corpora, such as described here (which are still analyses based on ‘just’ a word-to-word basis, without incorporating grammar or syntax!).

  • Actually, my understanding is that linguists often do use grammatical changes to resolve deeper branches in such phylogenies. It seems harder to build comprehensive statistical models of the way grammars change, though. With vocabulary, you can develop general models that are a pretty good fit over the whole tree, for the way that phonemes split/merge/mutate etc.

    And yes, lylebot, the “vocabulary” in this sense typically consists of phonetic transcriptions, so pronunciation would be incorporated. Word frequency I think is much harder to get good statistics on, especially for less well-documented languages. I have a link to some of the corpora (datasets) that are most commonly used for this sort of thing somewhere, but my mail is down – the Dagan paper probably has a link too.

    (NB I am not a linguist; I do algorithms/models for molecular evolution, though, and some of those models have found their way over to linguistic collaborators).

    Thanks for this post, Lucas: I routinely use slides of language trees in my undergrad class, and now I can use this one (and Dagan et al’s work) to show how a simple tree is never the full story.