Ruminating on Language and Word Frequency
Some time ago I came across this in Lydia Davis’s Essays Two:
I read recently that in English, a mere 43 words account for half of all words in common use, and that just nine (and, be, have, it, of, the, to, will, you ) account for a quarter in almost any sample of written English (my source is a very entertaining exploration of the English language, The Mother Tongue: English and How It Got That Way, by Bill Bryson).
Frankly, I found that rather astonishing, though obviously I know virtually nothing about linguistics. (People who care about verifying sources might want to know that the quotation comes from page 94 of the hardback copy of Davis’s book.) But it was interesting enough that I noted it, and then set it aside, thinking that some day I’d like to investigate that.
But this morning I came across this in a New York Times article under the headline Humpback whales sing the way humans speak:
The words in heaviest rotation [in writing and coversation] are short and mundane. And they follow a remarkable statistical rule, which is universal across human languages: The most common word, which in English is “the,” is used about twice as frequently as the second most common word (“of,” in English), three times as frequently as the third most common word (“and”), continuing in that pattern.
The article goes on to say that the songs of humpback whales have a similar distribution of linguistic elements. That’s interesting enough in itself, for many other reasons.
It’s worth noting that Davis’s list of common words is not the same as the list in the NYTimes article or Wikipedia. I suppose the particular items on the list and their order depends on which corpus of words is analyzed. But it seems that the general point holds. Like any other regularity, this one is captured in a principle, in this case one known as Zipf’s law. The law has application in fields beyond linguistics. According to that Wikipedia page, “it has been found to apply to many other types of data studied in the physical and social sciences.”
I suppose I have a practical interest in such things – over the past few years I’ve struggled to regain the somewhat limited ability I once had to speak and read German. One item on the (growing) list of disadvantages of living in the United States is that I have relatively few opportunities to practice speaking German, but I could at least find more time to read German texts. Obviously vocabulary development is a big part of my ability to do that, and I’ve often thought that paying attention to word frequency lists might be a way to focus my work on vocabulary. I’m morally certain that I already know the words at the top of the frequency lists for German – I’m even willing to bet that I know the top 43 words on that list; if that’s true, I’m halfway there! — but it would be interesting to see how deep my knowledge goes.
Setting aside the practical interest in word frequency, I’m intrigued by the thought that so few words carry such weight in any language. Not intrigued enough (yet) to do a formal investigation of word frequency. But intrigued enough to put this note here.