A while ago a couple of friends and I went on a small hiking trip. We live in the city, so the only way to get to the land of pastoral meadows and magical rivers is by car, and the more carpooling, the merrier. I was to be picked up at 9:15. But on that day, cometh 9:15, and lo and behold! There is no ride in sight. He only arrived several minutes later. But how can this be? Both my friend and I are very punctual. It is not like him to be late. Why didn’t we meet at the same time that we both had agreed upon earlier?
There are lots of potential outside reasons, such as heavy traffic, forgetting the keys to the car, or an elephant going into a frenzied rampage on the road. But even if no paranormal activity occurs, and everything goes according to plan, it’s still possible to arrive uncoordinated. Remember how in bank bust movies, the gang members always synchronize their watches prior to the heist? This is done frequently in movies, but very rarely in everyday life. Generally, we tend not to adjust our watches if they deviate a little from our friends’; what’s more, each friend’s watch shows a different time, so we couldn’t set ours according to just one. My friend and I could both arrive at 9:15 at the rendezvous location, but still not arrive at the same time – each going according to his own watch.
An interesting thing to find out would be the distribution of time in your everyday environment. Not the real physical time, which is actually distributed according to general relativity, but the time indicated by watches and clocks all around you. You might think that within the same household, there isn’t much room for variance, that all clocks would point at roughly the same time, say, to within a minute or two. This certainly wasn’t the case in my house, nor in other places that I have visited.
So with this goal in mind – to find out the time distribution of my environment – I set out and questioned every timekeeping device in my house, and some of those in my workplace. You would be surprised at how many such devices can be found in just one home. There are obvious clocks, like the ones you put on walls to actually know what the time is, as well as your cellphones and wristwatches. But there are also microwave ovens, VCRs, stereo sets, digital cameras, and even regular household phones.
I created a list of time differences between the time my own watch showed and the time the other device showed. Some clocks obviously haven’t been set in a long while, and differed from mine by as many as 10 minutes, in both directions – one running early, the other late. However, these were the extremes, and most clocks agreed with mine within about a minute and a half. Interestingly, there were more clocks that ran ahead and showed an later time than mine. This means that on average, if everyone acted precisely according to their own clock, I would be the late one more often than not.
What distribution should I expect? A Gaussian one? A power one? The obvious next step would be to plot the results and see how they look like. What should I plot, though? All I have is a list of time differences – it’s not something that can be plotted x vs. y. The proper thing to do would be to make a histogram. If my bin size is say, 20 seconds, I would see how many clocks were ahead by up to 20 seconds, how many in the range 20-40, how many 40-60, and so on. I started to do this, too, but then hit a barrier – how do I know which bin size to use?
Here is an example that demonstrates the difference. The top picture has a bin size of 20 seconds, while the bottom has a bin size of 25 seconds.
While the differences aren’t world shattering – the general shape stays the same – we can see that the bin size does influence the shape of the sharp decline in near the center of the histogram. Of course, bigger differences in bin size lead to bigger differences in histogram shape.
Histograms are very important statistical tools, and there are suitable theories for predicting the optimal bin size for histograms. Given an estimate of deviation, and the number of samples, the number of bins which minimizes errors can be attained (see Wikipedia). Most of these make strong assumptions about the data – for example, that they distribute normally.
Instead of using those methods and just choosing a bin size, I decided to smooth things out, and create an average histogram. The technique involves using what is called “kernel smoothing”. Suppose I sampled a physical variable, and got a series of data points. I want to draw a continuous graph representing that variable, but alas, I only have a finite number of points. To top things off, each measured point has errors, so it only roughly represents the measured variable. Here is a nice example, adapted from Wikipedia:
Kernel smoothing tries to reconstruct the original variable by looking at the weighted average of adjacent points. We go along the X axis in small steps, and each step we look at the data points located a short horizontal distance away. The values are then smoothed using a “kernel” function. This function should give greater weight to nearby points, and lesser weight to further away points. So the value of the continuous physical value at any point X0, is determined strongly by the close data points, and weakly by remote data points. Wikipedia’s example boosts not-too-bad results:
Bell shaped curves are frequently used in this technique, such as Gaussians or 6th degree polynomials. This smoothing method provides a good approximation, and is relatively simple. For this reason it is also used in SPH – Smoothed Particle Hydrodynamics – which is a technique to simulate the dynamics and flow of fluids.
So how did I create “the average histogram”? I iterated over a large number of bin sizes, and computed the histogram each time. I assumed that the value of each bar in the histogram is related to the center of the histogram bin. That means that if the bin “20-40 seconds” had 15 hits, I would treat it as if there were 15 clocks which differed from mine by 30 seconds. Putting all of the data in one graph, you get a messy graph with zounds of points:
There are some duplicate points here, but that’s alright – everything will get smoothed away. Applying the kernel smoother, we get a much nicer graph:
Notice the hole near the right end side – this is because the smoother doesn’t look at a fixed number of neighbors, but rather at all neighbors within a fixed distance. If there is lack of data, there might be holes / inaccuracies in the resultant graph.
The statistical value of this graph is somewhat dubious. It’s not inherently evident that smoothing a vast range of histograms gives an accurate description of the real distribution. Still, I think this is a nice result, and helps somewhat understand what the distribution is at least supposed to look like: we know that it is should be similar to a single, regular histogram, with a fixed bin size. As to what sort of graph this is – well, try as we might, it couldn’t be a Gaussian. The middle part falls very quickly, but there are still relatively lots of values at the far away points – remember that there were clocks which differed from mine by up to 10 minutes. Some sort of power law is much more probable, although it may be that no simple equation describes the real distribution of time.