Fine-tuning our view of how language changes
DOI: 10.1063/PT.3.3191
Language is dynamic. New words enter, words that have outlived their usefulness leave, and enduring words are used more or less often over time. But is the pace of those shifts constant? Or, for example, did English change more rapidly from 1850 to 1900 than it did from 1900 to 1950? Information theory can quantitatively address such questions with a function called the Jensen–Shannon divergence, D. Start with a database of representative words that shows how frequently each word is used in various years. Put in those usage probabilities word by word for the two years to be compared, and D yields a number that measures how much the language shifted. But, as Martin Gerlach, then a PhD student at the Max Planck Institute for the Physics of Complex Systems, and his colleagues have noted, D’s overall language assessment is blind to a salient feature: As the figure shows, unusual words, such as “genetic,” rise or fall in frequency more rapidly than do common words, such as “and.” (Note that the blue and green curves are displaced for ease of visibility.) In part to address that deficiency, Gerlach and company generalized the Jensen–Shannon divergence to a one-parameter family Dα. When α is large, Dα is sensitive to changes in the commonly used sector of a language; small α homes in on less frequently used words. So what about the evolution of English over the period 1850–1950? With the help of the Google Ngram database, the researchers found that the rate of change depends on how you ask the question: The overall change as measured by D was greater in the period 1900–1950, but 1850–1900 saw faster shifts in common words. (M. Gerlach, F. Font-Clos, E. G. Altmann, Phys. Rev. X 6, 021009, 2016, doi:10.1103/PhysRevX.6.021009