Gutenberg book popularity distribution

Project Gutenberg is a very neat site: it offers free electronic books that are in the public domain in the United States (in general, mostly books that were published before 1923). For the average user this means mostly “the classics” (for whatever form of “classics” you prefer; archaic and incomprehensible English not included).
Some books are more popular than others. The site’s “bestseller” is currently “Pride and Prejudice”, with about 25000 downloads in the last month. The majority of the most-downloaded books are indeed very well-known classics, although there are some exceptions (including #19, “The Romance of Lust: A Classic Victorian erotic novel”, which only narrowly exceeds the download rate of “The Picture of Dorian Gray”).
The natural question is, “how does book popularity fall as a function of rank?” Meaning, how much more popular is the most popular book compared to the second? The second to the third? And so on. Longtime readers of this blog (if they exist) already foresee me writing a python script to go over all the books, but alas, the source code of the downloads page explicitly prohibits this:

--
DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download http://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.
--

This is both great and a bit of a downer at the same time.

The zipped data is about 8 megabytes, but when unpacked results in a whopping 250MB xml file. Without messing too much with it, I managed to extract the data of about 45,000 books. The popularity decays thusly:

rank1

Ok, that’s not very informative. Let’s try zooming in a bit:

rank2

That’s better. Here is a something nice: 1) You can see that the first ~25 points have quite a bit of noise and are spread very far apart, while everything from ~25 onwards is much smoother. 2) The “most downloaded books” page in the Project Gutenberg site shows the first 25 most downloaded books. Coincidence?

As for the general distribution: like all things in life, I suspect a power law, meaning something like y = ax^b  , with b some negative number. The easiest way to see if this is true is to take the log of both sides, giving us a linear relationship:

\log y = \log a x ^ b = \log a + b \cdot \log x

The initial results aren’t that swell though:

loglog1

It is quite evident that the lower download rates – those of less than e^4 \approx 60  – heavily skew our otherwise-quite-close to linear relationship. We’ll do well to ignore them. This amounts to taking only about the first 10000 books:

loglog2

We can already fit an OK linear fit, but now the beginning is a bit off. This can easily be remedied by assuming that the rank does not start with 1, but with some larger number. Manually checking gives 6 as a good result:

loglog3

Converting this back to the original plot:

rank3

Success! y = 1.823e5 \cdot (x+6)^{-0.841}  .

Advertisements

One comment

  1. I bet there’s room to get to the top and beat Jane Austin just by using some more click-baity titles. One only needs to travel back in time and publish “10 ways to ensure peasants don’t rebel and chop off your head” or “You will never guess what happened after that lord exited his carriage” that would just kill the top lists 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s