Project Gutenberg is a very neat site: it offers free electronic books that are in the public domain in the United States (in general, mostly books that were published before 1923). For the average user this means mostly “the classics” (for whatever form of “classics” you prefer; archaic and incomprehensible English not included).
Some books are more popular than others. The site’s “bestseller” is currently “Pride and Prejudice”, with about 25000 downloads in the last month. The majority of the most-downloaded books are indeed very well-known classics, although there are some exceptions (including #19, “The Romance of Lust: A Classic Victorian erotic novel”, which only narrowly exceeds the download rate of “The Picture of Dorian Gray”).
The natural question is, “how does book popularity fall as a function of rank?” Meaning, how much more popular is the most popular book compared to the second? The second to the third? And so on. Longtime readers of this blog (if they exist) already foresee me writing a python script to go over all the books, but alas, the source code of the downloads page explicitly prohibits this:
DON'T USE THIS PAGE FOR SCRAPING.
Seriously. You'll only get your IP blocked.
Download http://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.
This is both great and a bit of a downer at the same time.
The zipped data is about 8 megabytes, but when unpacked results in a whopping 250MB xml file. Without messing too much with it, I managed to extract the data of about 45,000 books. The popularity decays thusly:
Ok, that’s not very informative. Let’s try zooming in a bit:
That’s better. Here is a something nice: 1) You can see that the first ~25 points have quite a bit of noise and are spread very far apart, while everything from ~25 onwards is much smoother. 2) The “most downloaded books” page in the Project Gutenberg site shows the first 25 most downloaded books. Coincidence?
As for the general distribution: like all things in life, I suspect a power law, meaning something like , with b some negative number. The easiest way to see if this is true is to take the log of both sides, giving us a linear relationship:
The initial results aren’t that swell though:
It is quite evident that the lower download rates – those of less than – heavily skew our otherwise-quite-close to linear relationship. We’ll do well to ignore them. This amounts to taking only about the first 10000 books:
We can already fit an OK linear fit, but now the beginning is a bit off. This can easily be remedied by assuming that the rank does not start with 1, but with some larger number. Manually checking gives 6 as a good result:
Converting this back to the original plot: