Bash Distribution

Bash.org, the “Quote Database”, hosts a collection of user submitted quotes from IRC. Users can submit parts of chat sessions that they participate in IRC channels, and if they are approved, they are uploaded to the site for everyone to see. Viewers can then browse the site and rank the quotes, giving them either a +1 or a -1 to their ranking. Some of the popular ones have a score of over 30,000. Of course, in the days of YouTube and Facebook, this concept isn’t new to us, but the site has been around since 1999, when I could only use either my telephone or the Internet, and the modem made sure that everyone in the house knew that you were connected.
The site is fun to browse, but I think anyone ought also to be interested in the distribution of quote rankings. How many high-ranking quotes are there? How many are there with negative scores? More generally, can we find a function that describes how many quotes have a certain score x? The first thing that comes to mind, of course, is the traditional Gaussian distribution, which appears in many places in nature. But we must remember that scores are not the sum of many random variables: it’s not like for every quote, a viewer votes +1 or -1 at random. There are higher quality quotes, and lower quality ones, and we expect that this will somehow be reflected in the distribution. In fact, history and nature have taught us that these distributions often come in the form of a power law, meaning that the number of quotes having a score of x is some function:

f(x) = \alpha x^{c}

Well, you can’t know for sure until you find out. So, faithful to my tradition, I wrote a small python script which goes over all the quotes on Bash.org, and saves their rating. There are 22,083 quotes currently stashed in their DB, and here is their distribution:

Notice the very long tail that positive ratings have, and the relatively shorter negative ratings one. Also note that the curve is asymmetric, and falls of much slower to the right than to the left. The majority of the quotes are positively ranked. Indeed, this does not seem like any Gaussian. Here again is the distribution, but with the scores divided into bins of size 10, to allow a smoother line. In red is the Gaussian which has the same mean and same standard deviation as the quote distribution.

The quote score is indeed a power curve – two power curves, actually: one for the left ascending side, and one for the descending side.
We could have stopped our analysis here and have been happy about it, but two weeks ago I read Mark Buchanan’s “The Social Atom”, which talks about how complex social phenomena can be explained by simple atomistic human behavior. For example, the distribution of wealth across a market obeys a power law, and can be generated by a simple money-trading model. The book itself is a rather light read and is very interesting; I recommend it to anyone interested in economics, sociology, physics, or computer modeling.
Inspired by the book, I created a simple model for ranking the quotes. In our model, there are people, and there are bash quotes. Each quote has a quality value, uniformly distributed, and each person has a “standard”, also uniformly distributed. A person likes a quote if its quality is above his standard. Thus, there are some people with high standards that will like only a small number of quotes, and some people with lower standards that will be entertained by most anything thrown at them. If a person dislikes a quote, he will lower its score by 1. If a person likes a quote, he will give it +1 rating, and also share it with his friends. Thus, low quality quotes remain unheard of for most of the population, and high quality ones will spread throughout the population.
Initially, these quotes are all on the web, and each person has a certain chance of “stumbling upon” a certain quote: running across it on the Internet in a search, reaching it through some external site, or just visiting Bash.org. The simulation is then as follows: for each quote, see who stumbles upon it, and let them tell their friends, until it stops spreading through the network.
This is obviously a very simple model, and there are a lot of factors which aren’t taken into consideration. For example, there are users who run across bash quotes and spread them to their friends, but do not rank them. There are also people who rank them, but do not spread them. And of course, the way we decide whether a person likes a quote or not seems quite arbitrary.
Nonetheless, the model, when run with the correct parameters, shows some features of the real distribution which we saw above. Here are the results from a run, where each agent has three friends with whom he shares the quotes. Again, we divide into bins of 10 for a smooth line.

The entire power curve is shifted to the left into the negative region, and there are no signs whatsoever of the long negative tail. However, despite these setbacks, we can be quite happy that we managed to capture one of the main features of the original distribution: the power-law fall of the high ranking quotes, and the long tail. I’m quite sure that not many modifications are needed in order to include the left side of the graph into the model – some feature which allows negatively ranked quotes to propagate even though the users have no reason to annoy their friends with low quality content.
Using this model, we can try to discover other phenomena which are not present in the original Bash site. For example, what happens if the site is “heavily moderated“, and initially only one person discovers each quote? In this case, he is the only one who says – “this quote will be distributed into the network”. Then the shape of the graph also greatly depends on how low or high his standard is.
Further, the results described above were achieved by setting the number of friends with whom each person shares quotes to a relatively low number – three or four. This may fit the Bash.org model, since it is an old and rather niche site. However, popularity contests on sites like YouTube should look ghastly different, since these are often shared on social networks such as Facebook, and the “number of friends” parameters can skyrocket to several hundreds. This parameter has a significant influence on the shape of the curve.
Overall, we saw how we can explain the popularity or rating distribution using a model based on a social network of people who share content with each other. There are still many things we can do with such model, and the specific one I gave certainly does not fit every site which can rank its items out on the net. The beauty is, that once we have built a model that fits our initial data, we can test it using other parameters in order to make predictions or say how the world would have behaved, if things were a little different.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s