Like just about everyone else, I’ve gotten sucked into playing Wordle. I typically don’t go in for games with such straight-forward mechanics, but for a couple reasons it hits a sweet-spot for me. First, it is time-boxed: you can only play one round a day and each round is relatively short, so there’s no risk of it sucking up big chunks of time. Second, on its surface it is a word game (and I’m a bit of a word nerd), but underneath it is a game about probability.
Why do I say it’s a game of probability? Because there’s really only two strategies for choosing the first word, and they are both based on the frequency of letters in the word list:
- Select the word that (on average) maximizes the overlap with the set of letters in the answer word (i.e. the number of yellow tiles)
- Select the word that (on avearge) maximizes the number of correctly placed letters (i.e. the number of green tiles)
I tend to prefer the first strategy, but I do not have any concrete analysis to show that it is better than the second. Whichever strategy you choose, you need to decide what your first word will be, and you can narrow it down to a relatively small number of choices using some simple analysis of letter frequencies. For fun, I will do this using just some unix command line tools and a bit of R.
The word list⌗
First we need a list of words. While Wordle does not publish their word list, we can probably get a good approximation of the letter frequencies in that list using any large list of 5-letter words. Fortunately, linux-based operating systems provide a list of words read for use:
Since this file contains words of varying lenghts, we need to filter it to keep only the 5-letter words. Also, since it appears the list contains words with a mix of upper- and lower-case, and since Wordle is not case-sensitive, we need to normalize the letter cases. This can be accomplished with
awk and the
Finally, even though this list is probably already sorted and does not contain duplicates, it’s still good practice to make sure. We can do that with the
Here is the full pipeline. I’ve piped the result to head so we can just see an example of what we’re working with:
Strategy 1: Maximize Yellow Tiles⌗
Maximizing yellow tiles simply means choosing the word with the letters with the highest frequency in the dictionary, ideally without repeating any letters (since a repeated letter doesn’t give us any additional information). To get a frequncy table, we can use
uniq -c, which collapses repeated lines into a single line and a count. But first we need to transform our list of words to a list of characters. For that, we can use
fold -w1 which wraps each line to a length of 1. After folding, we need to sort, then count, then sort again to get our final sorted frequency table.
It turns out that
a,e,r,i,o are the most frequent characters in 5-letter words in the dictionary, so our ideal first word would contain these characters. We can use grep to check if there are any such words.
There are a few, but unfortunately they all have at least one repeated character. Since
s is the sixth most frequent character, I replaced
s and, after removing any words with duplicated letters, end up with:
After playing Wordle for a few weeks I have not seen any uncommon words, so my assumption is that the word list has been filtered to contain only words that most people will know. I think
serai would be much less likely than the other two, so I would choose
raise as my first word to maximize my chance of getting a hit on my first guess.
Strategy 2: Maximize Green Tiles⌗
To maximize green tiles, we are not just interested in letter frequency, but specifically positional letter frequency, i.e. which letter is most likely in each of the 5 positions. Analyzing positional letter frequency is (I think) more than we can do with linux command line tools, so instead I’ll load the data into R and do my analysis there. We can still stick with the command line thanks to the
Rscript command. I’ll be using the pipeline operator (
%>%), which requires installing the
The first thing we have to do is read the word list from
stdin. There are many ways to do this, but I’ll use base R whenever possible to minimize the number of libraries I need to install.
From here, we want to get to a matrix of words (rows) by letters (columns), i.e. an
N*5 matrix. For that, we can use
strsplit followed by
unlist to get to an array of characters, then fold them into a matrix using
As a first approach to analyzing positional frequency, we can simply count the frequency of each letter in each column, using
table (for brevity I’ll just focus on the R command from now on):
Here we see that
a is most common in positions 1 and 3,
e is most common at position 2,
s is most common at position 4, and
t is most common at position 5, but unfortunately
aeast is not a word.
Another approach is to look at how over-represented each letter is in each position relative to the other letters, e.g. by normalizing by the median frequency:
This tells us, for example, that
a in position 1 is more important than it is in position 3. With this information, we can create a regular expression that we can use to
grep the word list to give us some candidates where positional letter frequency is maximized. For example, here I’m including every letter with a score > 2.25.
beast, canoe, and cease are fairly common, and of these
beast has the highest score (the sum of the value of each letter in each column).
What to do next?⌗
Assuming this wasn’t the one day in every ~30 years where your first guess was correct (assuming a word list of ~10,000), you need to decide between two strategies for choosing your second guess. Unlike the first guess, which can be the same every time, what to do for the second guess depends on how many yellow and green tiles your first guess turned up. If you think you have enough tiles, you can start trying to guess the correct word. If not, you are better off guessing a second word designed to maximize the number of yellow or green tiles, meaning you can follow the same general approach you did for choosing your first word.
For example, I use
arise as my first word. Excluding those letters, the next most frequent letters are
Excluding words with repeated letters and less common words,
clout, count, and uncut seem like the best bets. I use
count for my second word when
arise does not turn up enough tiles.
After two guesses, you will at least have one vowel. If you still don’t have enough tiles to make a guess at the word, then you can use your knowledge of letter frequencies to come up with a third guess that will maximize the number of yellow or green tiles. For example, if I just have an
moldy is a good third guess.
The question of how many tiles you need to know before trying to guess the word is an interesting one, but I do not have a good intuition for how to determine this objectively. My personal rule of thumb is that if I have at least 3 tiles I will start tailoring my guesses to uncover the correct positions of the yellow tiles. For example, if I guessed
r, t, and e are yellow, then my next guess might be
tires - new positions for each of the yellow tiles, and two new letters whith high frequency.