Monday, June 6, 2022

Information Per Minute v. Syllables Per Minute v. Information Per Syllable

The claim made, which I don't consider to be definitive or authoritative, but is suggestive, is that some languages pack less information into each syllable than other, and that when they do, people talk faster, resulting in similar amounts of information conveyed per minute. 

To investigate this puzzle, researchers from the Université de Lyon recruited 59 male and female volunteers who were native speakers of one of seven common languages — English, French, German, Italian, Japanese, Mandarin and Spanish — and one not so common one: Vietnamese. All of them were instructed to read 20 different texts, including the one about the house cat and the locked door, into a recorder. All of the volunteers read all 20 passages in their native languages. Any silences that lasted longer than 150 milliseconds were edited out, but the recordings were left otherwise untouched.

The investigators next counted all of the syllables in each of the recordings and further analyzed how much meaning was packed into each of those syllables. A single-syllable word like bliss, for example, is rich with meaning — signifying not ordinary happiness but a particularly serene and rapturous kind. The single-syllable word to is less information-dense. And a single syllable like the short i sound, as in the word jubilee, has no independent meaning at all.

With this raw data in hand, the investigators crunched the numbers together to arrive at two critical values for each language: the average information density for each of its syllables and the average number of syllables spoken per second in ordinary speech. Vietnamese was used as a reference language for the other seven, with its syllables (which are considered by linguists to be very information-dense) given an arbitrary value of 1.

For all of the other languages, the researchers discovered, the more data-dense the average syllable was, the fewer of those syllables had to be spoken per second — and thus the slower the speech. English, with a high information density of .91, was spoken at an average rate of 6.19 syllables per second. Mandarin, which topped the density list at .94, was the spoken slowpoke at 5.18 syllables per second. Spanish, with a low-density .63, ripped along at a syllable-per-second velocity of 7.82. The true speed demon of the group, however, was Japanese, which edged past Spanish at 7.84, thanks to its low density of .49. Despite those differences, at the end of, say, a minute of speech, all of the languages would have conveyed more or less identical amounts of information.

“A tradeoff is operating between a syllable-based average information density and the rate of transmission of syllables,” the researchers wrote. “A dense language will make use of fewer speech chunks than a sparser language for a given amount of semantic information.” In other words, your ears aren’t deceiving you: Spaniards really do sprint and Chinese really do stroll, but they will tell you the same story in the same span of time.

Language Log has a recent post on a similar theme recently, looking at large written corpuses of U.N. translations, which have more methodological and technical issues, but conclude that information density relative to syllables or word units have a bell curve distribution but do vary somewhat from language to language.

