This week the director of research at Google, a fellow named Peter Norvig, published on his blog the results of a fascinating deep-dive into the English language. If you're at all interested in letter frequency or word frequency, you owe it to yourself to at least skim Norvig's post. Essentially the researcher who in the '60s wrote the seminal work on the topic suggested to Norvig that, given the ocean of words Google can pore through, Norvig might want to update the original study that measured letter frequency in 20,000 English words, using computing techniques from 50 years ago.
Norvig bit. He gathered and searched nearly 100,000 words with more than 743 billion mentions, and ranked both the frequency of words and of individual letters. ("The" crushed second-place "of" among words; "and," "to," "in," "a," "is" and "that" also all account for at least 1 percent of word usage.) For years, Norvig writes, typesetters knew the mnemonic ETAOIN SHRDLU represented the order of the most-frequent letters. But his results tilt that slightly. E, T, A, O, I, N and S are still the top seven. But he found H and R swapped places, as did L and D. And that C comes before U. Not exactly earth-shattering, but for quantitative nerds who also like to read and talk, it's fun enough.
When I noticed this study, I happened to be in the throes of a terrible Scrabble binge, playing promiscuously with friends and strangers online. Once a fever catches you in Scrabble, it's hard to shake. You become misanthropic, savage. Opponents are not merely random Web-based jerks; they are mortal enemies, and the best in life is to crush them, to see them driven before you, and to hear the lamentation of their women. Of course I had to turn this beneficent Norvig survey to my dark advantage. So I did, and so too can you. Here goes.
In Scrabble the frequency of a letter is virtually synonymous with its worth in points. Logically, rarer letters are worth more points, as they're harder to play. Thus you assume the point values of letters rise in tandem, or nearly, with a letter's obscurity. So if you're a Scrabble player, this leaps out at you from Norvig's analysis: While ETAOINSR are all, appropriately, 1-point letters, the rest of Norvig's list doesn't align with Scrabble's point values. HLDCUMFPGWYBV are Norvig's next block; Scrabble's values for those letters go 4, 1, 2, 3, 1, 3, 4, 3, 2, 4, 3, 4—clearly more erratic. The upper-value letters then return to harmony with Norvig's findings. Scrabble scores KXJQZ as 5, 8, 8, 10 and 10 points.
This potentially opens a whole new system of weighing the value of your letters. The values of Scrabble letters haven't changed in the 75 years since its inventor, Alfred Butts, gleaned what he thought were fair points by teasing out letter frequencies from reading newspapers. But H, which appeared as 5.1 percent of the letters used in Norvig's survey, is worth 4 points in Scrabble, quadruple what the game assigns to the R (6.3 percent) and the L (4.1 percent) even though they're all used with similar frequency. And U, which is worth a single point, was 2.7 percent of the uses—about one-fifth of E, at 12.5 percent, but worth the same score. This confirms what every Scrabble player intuitively knows: unless you need it to unload a Q, your U is a bore and a dullard and should be shunned.
Yet it's not quite so simple. Norvig's letter frequency survey combed all the words, even repeats. T and H and E get a major boost in letter frequency for appearing in a word that accounts for 7.1 percent of all usage, but you're not going to play "THE" one out of every 14 turns in Scrabble. What we need is to run a Norvig-style count of letter frequency not as books use words but as Scrabble uses words: as entirely grammar-independent units. For an apples-to-apples comparison, we need a letter-frequency survey of every individual word in the Scrabble dictionary.
I enlisted a crafty software developer friend of mine named Kyle Rimkus, who in short order managed to pry a Scrabble word list out of the Web and then ground out a letter-frequency count within that sample. Turns out you can build every legal Scrabble word using 1,584,476 individual letters. Kyle then weighted all those 1.58 million letters within the 98-letter game (excluding the two blank tiles, in other words) and assessed how overvalued or undervalued each letter is compared with its existing points.
We were hoping to figure, in short, what point values a mathematically fair game of Scrabble would assign to its letters. Here are the results:
|Letter||# Of Tiles In Scrabble||Frequency In All Scrabble Words||Points In Scrabble||Ideal Points In Scrabble||Difference|
A couple of things stand out about this list. The first is, the current system is overwhelmingly spot-on, or pretty close. Nearly half the alphabet is already at its ideal points total, including that big, swinging X. The next is, Scrabble weights its mid-range letters too heavily (B, C, F, H, K, M, P, V, and Y); those letters are proportionally more valuable than they are difficult to play. But in a sense, that stands to reason, if only in that a perfectly fair Scrabble, by points distribution, would have a 14-point J and a 14-point Q. If that were indeed the case, letters that valuable could throw the game completely off its axis, especially given all the multipliers the game incorporates. (Tee off with a double letter score and a triple word score and a single Q or J would be worth 84 points.)
If we cap a letter's value at its present ceiling of 10 in our ideal Scrabble, then the J is rightly a 9- or 10-pointer instead of its current 8. The Z should be worth 6 or 7. Those 4-point Y's and H's should be worth no more than 3 apiece. Make the single-point L's and U's worth 2 points apiece.
But since Scrabble isn't going to issue a game with those point totals any day soon, you should incorporate this knowledge into your overall approach. Get rid of your J and your Q as quickly as possible, because they're just damn hard to play and will clog your rack. The Q, in fact, is the worst offender. Kyle found that in words with between one and five letters, which are the words you're most likely to play, the fair value of a Q should be 18 points. The J gets a bit easier to play, and ideally would be valued at 7 points among short words. God bless "jo" and "qi."
Also, you should relish the letters above that carry negative values in the "difference" column, especially the H, the Y and the Z, which relative to the rest of the tiles pay off handsomely for how easy they are to deploy. The language contains more Z-words than Alfred Butts apparently realized, and you can exploit this knowledge. Amaze your friends, raze your enemies.