Not long ago, I started learning Lisp, the dreaded programming language notorious for its (multiple (parentheses)). Specifically, Common Lisp, which is one of the major Lisp dialects (yes, lispers consider Lisp a language by all means). There are several good books on the topic, amongst them Peter Seibel’s Practical Common Lisp. In this book he suggests to use the “emacs” editor and an emacs Lisp mode called SLIME, which lets the user run Lisp directly in the editor, as well as offering other IDE features. I followed his advice, seeing it as an opportunity to finally discover what emacs is all about (it’s both amazingly wonderful and horribly despicable at the same time. Those of you who program should definitely try it, although the learning curve is a tad high).
SLIME has some nifty features, but the really interesting thing about this editor, is that every time you open the Lisp interpreter, it gives you some words of encouragement.
Why, yes, I suppose it could indeed be the start of a beautiful program! This is actually a very nice way to start your programming session, much better than the “startup tips” that many integrated development environments use.
But the really interesting thing about these words of encouragement, is that every time you open an interpreter, a new phrase is displayed. Opening a second instance, I now get:
Of course, it’s obvious that the sentences are built into the program – they are not generated with a random algorithm (say, via a Markovian process) but are just displayed at random. The most probable implementation is that there is a collection of such phrases, and at startup the program chooses one at random and prints it on the screen.
An interesting question to ask is: without checking the source code, how can I know how many such phrases there are in total? An experimental approach would be to open the program again and again and again, each time writing down the encouragement displayed. If there is a fixed number of phrases, then at some point no new phrases will be introduced. But how do I know when to stop looking? When do I finally decide that I have seen all the phrases that exist in the “words-of-encouragement-repository”? What if, by sheer chance, there was one poor sentence which was never selected? If I stop early, I will leave that one out.
Luckily, some basic probability can help us here and give us an answer which is accurate to within whatever error threshold we see fit.
Let’s start with a clean state. We’ll imagine a case in which we have no clue as to how many different sentences there are. There could be only one, there could be two, there could be 200. Call this number N.
The first time we run SLIME, we would get a sentence which we have never before seen, since we haven’t seen any sentences yet. The repository size is at least 1.
The second time we run SLIME, we might get the same sentence, or we might get a new one. The more sentences there are in the repository, the higher the chances of getting a new one: if there are N sentences overall, then there is only a chance of getting the same first sentence.
If we got a new one, great! we know that the repository size is at least 2. But if not, we can ask for a third sentence. If that one is the same as the first, we still don’t know if there is more than one sentence; but, assuming there are N overall choices, the chances of getting the same sentence twice in a row is . So we take a fourth one, and if THAT is the same, well, the chances of that happening are .
We see that if we poll again and again and keep getting the same sentence, then it isn’t very likely that N is very large, because high powers of will go to zero very quickly for large N. In fact, if we were to keep getting only the same original sentence, we would very soon be convinced that there is only one sentence in the repository.
Let us generalize. Suppose that we have asked for sentences several times already, and we have already seen k different ones. We would like to know what the chances are that “this is it, there are only k sentences in the whole words-of-encouragement-repository, and no more”.
If there are indeed only k sentences, then no matter how many more times we ask, we will not get any new encouragement. If, however, the actual repository size is larger than k, then eventually, by chance, we might run across one that we haven’t seen before.
So our strategy is this: we keep asking for more sentences; if, after a long time, we don’t get any new ones, we can say that there are probably only k sentences. But, what is “long time”, and what is “probably”? These can be quantified exactly.
We have already seen k different sentences. That means that the repository size is at least k. But what if it were exactly k+1? That means that every time we ask for a sentence, there is a chance of getting that extra one which we haven’t seen. This means that there is a chance of getting one that we have already seen.
So if there are k+1, and not k, different phrases, then the chance of not seeing a new encouragement phrase s times in a row is . But is a fraction smaller than 1. This means that if we keep asking for more sentences, but only get ones we already saw, the probability that there are actually k+1 different items in the repository drops lower and lower.
We can now define a threshold which says how confident we are that the repository indeed contains only k items. If we choose, for example, 95%, then we ask for more and more sentences until . Then, after s polls, we can declare “There is a 95% chance that there are only k, and not k+1, different phrases!” This makes clear what we mean by “long time” and “probably” – we choose our level of confidence, and wait for a streak until that confidence is reached. This might not be exactly the confidence that there are k sentences (and not any other number), but it suites our cause well.
Of course, it might be that during our search, a new item will indeed come up. Wonderful! We increased our knowledge about the repository. In this case, we need to update our k value, and reset the s count.
So I wrote a program (in Lisp, of course!) which does the following. It starts in a clean state, not knowing anything about the number of sentences in the words-of-encouragement-repository. It is then fed the yielded sentences, one by one. At all times, if the number of different sentences it has seen is k, it looks at the current streak s of non-new sentences it received, and asks itself how likely this streak is, assuming that the actual repository size is k+1. When the probability is less than a given threshold, it stops, proclaiming that within the requested confidence, there are k different sentences in the repository.
Note that if the repository size is finite, then this algorithm is bound to terminate – each new poll either increases the known repository size, or decreases the chance of the repository being bigger. Further, we only check k+1, and not all possible numbers larger than k, because drops to zero faster for larger z’s, so if we are confident that the repository is smaller than k+1, we are even more confident that it’s smaller than k+z.
The result? I gave the program a threshold of 0.01, which means 99% confidence. It terminated after processing 55 sentences, stating that there were 8 different phrases with a 0.00899 chance of error.
How does that compare with SLIME’s source code?
(defvar slime-words-of-encouragement '("Let the hacking commence!" ; 1 "Hacks and glory await!" ; 2 "Hack and be merry!" ; 3 "Your hacking starts... NOW!" ; 4 "May the source be with you!" ; 5 "Take this REPL, brother, and may it serve you well." ; 6 "Lemonodor-fame is but a hack away!" ; 7 ,(format "%s, this could be the start of a beautiful program." ; 8 (slime-user-first-name))) "Scientifically-proven optimal words of hackerish encouragement.") ; Documentation
Victory for probability and Lisp.
Footnote too cool not to tell about:
A nice related question to ask is, “if there are N different sentences overall, then on average, how many times will it take to see all of them, if each time I select one at random?”
The best case, of course, is N times – if, by miraculous stroke of luck, each time you pick a sentence at random, you pick a different one. The worst case, potentially, is sort of never – there might be one which would, by a foul stroke of misfortune, never get selected (although the chances of this happening are practically 0). So, what is the average case?
Suppose that you have already seen k out of N sentences. Then the chance of seeing a new one is . So on average, you would need to attempts. So the total expected value is
In words, this is N, our repository size, times the harmonic series, . This series is divergent, meaning that as N goes to infinity, so does H. However, while it grows relatively quickly for small N, for larger ones additional elements barely have any effect. For example, when N = 11, H is about 3. If we want to double that, H = 6, then N = 227. If we want to double that, H = 12, then N = 91,380. So while the harmonic series’ effect on the expected value is not very large, it ensures that the average case will need more than N attempts.