I recently stumbled across a nice word game called “Doublets”, invented by Lewis Carroll. It goes like this. Player 1 writes two words with the same number of letters, say “cone” and “hoof”. Player 2 must then find a chain of words, each of them differing by one letter, which starts at the first word and ends with the second. In our case, this is not so hard:

After Player 2 finds the chain, the roles are reversed.

As a student of mathematics, I am also obligated to write: “The generalization to players is easy and is left as an exercise for the reader”.

Who wins? Well, in life there are no winners, in the long run. But if you insist, and if there are more than two players, you can give them a limited amount of time to find a chain, and the winner(s) is the person who finds the shortest one. This is not the best victory condition, though, because as we’ll see, it can be very easily automatized.

But first, a preliminary: **What is a word, anyway?** Anyone who has played scrabble has often wrestled with this question, often till bloodshed. Is “LOL” a word? Are “color” and “colour” different words? Is “hot-dog” a word, or a phrase? Is “taekwondo” – approximately translated to “way of the kick and fist” from Korean – a word in English? If so, are each individual kicks – “yop chagi”, “ti chagi”, … – also words? If “carbon” is word, what about “sucrose”? What about “dichlorodiphenyltrichloroethane”? What about words and phrases in Latin?

These are wonderful questions, and the most wonderful thing about them is that they have no definitive answer, and so forever will scrabblers quibble. For our purposes, a word is anything which appears on this list, which is the official European English-language scrabble wordarium. Interestingly, “wordarium” has not yet found its place there.

Onwards to the automization! Doublets is a fun game to play with your friends, but not-so-fun to play against the computer. All the computer has to do is load up the words, and for each word-length, create a graph. The vertices will be the words, and two words will be connected if they differ by just one letter. Here is the graph for two-lettered words, for example:

The more letters, the larger the mess. Here are the main hairballs for three- and four-lettered words (no labels this time – there are thousands of vertices):

Finding out a shortest chain between two words is then equivalent to computing the shortest path between two vertices in the graph, a task that computer scientists have already known how to do for centuries, if not millennia.

Thus the computer will beat you every time. It will even beat me! And it will be ruthless. It won’t take prisoners. It will be optimal and efficient. It will use words you did not know existed, because it knows all 276,663 words in the list and how could you have known that “xebec” was an actual thing? And when it beats you, it will not flinch. It will not care. It’s a computer.

That’s why, as long as you’re human, it might perhaps be better to take the artists’ view of life: The player who offered the two initial words must choose, from the chains that the other players came up with, her favorite solution. On what criteria? Well, it might be elegant; it might be touching; it might by ironic; it might tell a story, moving in a way that only a one-sided chain of distance-1 words can be moving. But most importantly, it will be hard for a computer program to generate.

For the next couple of months, anyway.

But if we’ve already loaded up all the words into a graph, the least we can do is use the computer to learn something about language, and analyse the connectivity structure. There are a myriad of questions you can ask (all good projects for prospective computational linguistics students). Here is a small sample:

**Connected components**: Sometimes it’s impossible to get from one word to another. The most common case are isolated words – words which do not connect to any other, like “zythum”. But it can also happen that the graph is divided into connected components of larger sizes. How do the connected components distribute? What is the average size of a connected component (in the sense that if you pick a word at random, what is the size of the component it belongs to)?

The short answer is that as the word-length increases, it becomes harder and harder for words to stay in touch. I have some thoughts on why this is (a simple reason might be because words are more spread out in the exponentially large space of all possible letter combinations), but they’re not based on any actual analysis. Understanding language is hard work! Anyway, here are graphs showing the size and fraction of the average connected component:

These graphs (and some other calculations not shown here) suggest that the game is best to play for words of length 5 or 6: In these cases, most of the graph is made up of a single connected component (so you can almost always get from one word to another), and these components are very large (so you have a lot of raw material to work with).

**Distances**: What is the average distance between two words? What pairs of words are particularly hard to connect? Here is the graph of the average distances:

The reason that for the giant dip in large word-lengths is that the graph becomes highly disconnected, and the average you see above ignores pairs which cannot be reached.

You will probably also be very excited to know that the winners of the dubious “longest-path award” are the seven-lettered pair “gainest” and “injects”, with a whopping 57 letter switches. I’ll let you find out the optimal path from one to the other on your own. Meanwhile, here is a largest path (44 transitions) for six-lettered words:

The rude nude is emboiling indeed. **Challenge**: Write a short story using all these words.

**Clusters**: Already in the graphs for three- and four-letter words you can see that there are “clusters”, regions of the graph which are more heavily connected to themselves than to other parts of the graph. How do these clusters behave? Do they have any relation to prefixes and suffixes such as “un-” and “-ing”? An excellent question, I say! And an exercise for the reader, too.

For the interested, here is the highly-non-optimized, quickly-written, uncommented, undocumented, inefficient, impudent and generally shameful code I used to generate the statistics. It might take an hour or two to run.

import networkx as nx import matplotlib.pyplot as plt import numpy as np def hamming_distance(s1, s2): """ Assumes strings are of the same length """ return sum(c1 != c2 for (c1,c2) in zip(s1,s2)) def longest_path(G): the_longest_path = [] for (source, paths) in nx.all_pairs_shortest_path(G): the_longest_path = max(the_longest_path, max(paths.values(),key=len),key=len) return the_longest_path words_filename = "sowpods.txt" with open(words_filename) as f: all_words = f.read().split() lengths = [len(w) for w in all_words] words_by_length = {} for n in range(min(lengths), max(lengths)+1): words_of_length_n = [w for w in all_words if len(w) == n] words_by_length[n] = words_of_length_n graphs = [] connected_component_sizes = [] fraction_of_connected_components = [] average_size_of_connected_components = [] average_fraction_of_connected_components = [] number_of_connected_components = [] average_degrees = [] average_path_length = [] maximum_path_length = [] for word_length in range(min(lengths),max(lengths)): print(word_length) G = nx.Graph() for w1 in words_by_length[word_length]: G.add_node(w1) for w2 in words_by_length[word_length]: if hamming_distance(w1, w2) == 1: G.add_edge(w1, w2) graphs.append(G) cc_fractions = [] cc_sizes = [] for cc in nx.connected_components(G): cc_fractions.append(len(cc) / len(G)) cc_sizes.append(len(cc)) connected_component_sizes.append(cc_sizes) fraction_of_connected_components.append(cc_fractions) temp = np.array(cc_fractions) average_fraction_of_connected_components.append(sum(temp*temp)) average_size_of_connected_components.append(sum(temp*np.array(cc_sizes))) number_of_connected_components.append(len(cc_sizes)) average_degrees.append(np.mean(list(dict(G.degree).values()))) max_distance = 0 num_of_dists = 0 total_dist = 0 for (source, dests) in nx.all_pairs_shortest_path_length(G): max_distance = max(max_distance, max(list(dests.values()))) total_dist += sum(list(dests.values())) num_of_dists += len(dests) if num_of_dists != 0: average_path_length.append(total_dist / num_of_dists) maximum_path_length.append(max_distance) else: average_path_length.append(0) maximum_path_length.append(0) ## Just an example: Draw the graph of two-letter words. # nx.draw_spring(graphs[1], with_labels=True, font_weight='bold') # plt.show()