Sine-wave speech recognition in Mandarin
DOI: 10.1063/PT.5.010162
Last October my friend Karl invited my wife and me to celebrate China’s mid-autumn festival with some of his Chinese friends. The venue was a Chinese restaurant in Arlington, Virginia. After we’d eaten eight delicious courses and drunk (or tentatively sampled) sorghum vodka from Taiwan, Karl challenged his Chinese friends to recite Chao Yuen Ren
《施氏食獅史》
石室詩士施氏,嗜獅,誓食十獅。
氏時時適市視獅。
十時,適十獅適市。
是時,適施氏適市。
氏視是十獅,恃矢勢,使是十獅逝世。
氏拾是十獅屍,適石室。
石室濕,氏使侍拭石室。
石室拭,氏始試食是十獅。
食時,始識是十獅屍,實十石獅屍。
試釋是事。
Even if you can’t read Chinese, Chao’s poem looks as though it might be straightforward for native speakers to hear and understand. But Chao, who was a linguist, wrote the poem to demonstrate the futility of transliterating classical Chinese into the Roman alphabet. Every character in the poem is transliterated as “shi.”
Granted, Mandarin uses four tones that help distinguish otherwise identical-sounding words: high level, rising, falling and then rising, and high falling. But adding the corresponding diacritical marks does little to ensure comprehensibility.
As we discovered around the dining table, the poem is also incomprehensible—hilariously so—when recited in modern Mandarin
Formants and fundamentals
Five months later, I came across an answer to my question in a paper
The starting point for Yin’s work is the idea, originated by Gunnar Fant in 1960, that speech can be passably reproduced by modulating the amplitudes of a small number of sine waves of certain fixed frequencies. Those frequencies do not include the overall pitch of a person’s voice, what you might call its fundamental frequency F0. Rather, the frequencies correspond to the strongest peaks that are present in the frequency spectrum when a person utters a given vowel or consonant.
Fant called those characteristic frequencies formants. Only two formants, f1 and f2, are needed to reproduce vowels. For example, the “oo” in “boot” can be represented with one sine wave with a frequency f1 of 320 Hz and a second, weaker sine wave with a frequency f2 of 800 Hz.
Mandarin’s four tones are conveyed by modulating vocal pitch, F0. Because sine-wave speech dispenses with F0, Yin and his colleagues hypothesized that Mandarin speakers would have a tough time understanding sine-wave Mandarin.
To test the hypothesis, Yin and his colleagues asked 41 native speakers of Mandarin to listen to two sets of sine-wave speech. The first set consisted of 10 unconnected monosyllables pronounced with each of the four tones. The second set consisted of 20 short sentences.
Listeners to the unconnected monosyllables could not reliably identify the correct tone. On average, they got the tone right only 33% of the time, which is little better than the 25% they’d score if they just guessed. Listeners did much better with the short sentences. Some listeners understood all the sentences completely. The worse comprehension rate was 78%; the mean was 92%.
Yin speculates that the Mandarin speakers in his study, being familiar with the syntax and semantics of their native language, exploited contextual clues in the sentences to compensate for the lack of tonal information. That speculation is consistent with the result of a linguistic experiment that I’ve conducted: Native speakers of English can converse with each other even when they replace every vowel with “uh.” The conversation may sound odd, but it’s comprehensible. Try it yourself!
For some people learning Chinese, tones constitute an awkward, additional complication. At first glance, Yin and his colleagues’ work might therefore bring some relief. Tones, it seems, aren’t essential to comprehension, at least for sine-wave Mandarin. But the relief is illusory. Reaching the point where you can dispense with tones doubtless requires mastering the whole language, tones and all.