Busted by Google

I am currently reading Jules Verne’s “Twenty Thousand Leagues Under the Sea”. It’s a rather spectacular book, and I encourage all scientists among you to read it, if you haven’t already. It has awakened a sort of awe towards discovery in me, and the naivety of curiosity. Of course, non-scientists are welcome to read it as well, and even more so than others; perhaps it will spark within them a tingling to go forth and explore.
The book itself is rather famous, and while I suppose that slightly more people have read Harry Potter, it does ring a bell to most of the Western world. However, how many variations are there on the title’s number? Surely, most people would know what “20,000 Leagues Under the Sea” is. But what about “10,000 Leagues Under the Sea”? Two leagues? Are there any references to such titles? Are there spoofs or spin-offs? Do such things even exist? I know that it is probably much more common to alter something other than the number in the title (for example, Spongebob’s “20,000 Patties Under the Sea”), but still, the question lingered.
Remembering XKCD’s wonderful tactic for solving problems and finding the popularity of virtually all things, I decided to go to Google and see what the global community has to say in the matter (btw – do NOT look at the video that the comic talks about). Sure enough, “20,000 Leagues Under the Sea” has about 1,140,000 results in Google. Changing it to “2 leagues” yields only about 3,760 results. 19,997, however, returns nothing.
Our method is therefore obvious: iterate through all integers from 1 to 20,000, and plot out a graph of hits vs. number of leagues. Prior to starting this task, I hypothesized that a straight out plot will not do, for a number as high as a million will dwarf out any other result which is not significantly large. The graph would be rather dull. It would hence not be totally without purpose to apply various mathematical functions, such as taking the logarithm of the number of results (excluding 0, of course, which might as well remain 0).
With this in mind, I set out to write the python script to carry out the job. urllib2 was used in order to get the search page, and the parsing was written in horrible manner which is not extensible, because I was too lazy to work with a normal parsing library. The code looks something like this:

import urllib
import urllib2

base_url = """http://www.google.com/search?hl=en&source=hp&q="%s+Leagues+Under+The+Sea"&btnG=Google+Search"""
user_agent = "Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)"
headers = { "User-Agent" : user_agent }

k = 1
n = 20000
about_marker = "about <b>"
end_marker = "</b>"

file_to_write = open(r"D:\results.txt", "wb")

for i in xrange(k, n+1):
    print "Processing request number %s" %i
    url = base_url % i    
    req = urllib2.Request(url, headers=headers)
        response = urllib2.urlopen(req)    
        data = response.read()
        if data.find("No results found for") != -1:
            number = 0
            about_index = data.find(about_marker)
            beginning_of_num_buffer = data[about_index + len(about_marker):]
            end_index = beginning_of_num_buffer.find(end_marker)
            string_number = beginning_of_num_buffer[:end_index]
            number = int(string_number.replace(",", ""))
        file_to_write.write("%s\t%s\n" % (i, number))
    except Exception, e:
        print "Oh no! %s" %repr(e)
print "All done!"

In the end, we get a generated file, “results.txt”, which contains a tab separated list mapping between the number of leagues and the matching amount of pages. This can easily be exported to Excel or OpenOfficeCalc by simple copy-paste, and after a little processing as mentioned above, plotted.
Unfortunately, Google outsmarted me, and I have failed in my task. Even though my own wishes were innocent, and my causes pure, there are others on the internet with much more insidious desires, whose only crave is destruction (or more formally: DoS bots and other pests). While things started out rather smooth, after about a hundred GET requests, I noticed that I kept getting exceptions, stating that the site was unavailable. Opening Google with my browser, my heart plunged:

Ouch. They were on to me. But I didn’t mean anything bad! I only wanted information…
The next logical move would be to try to somehow avoid their automated detection. My current proposal is to wait for a large amount of time between each request. For example, at the end of the loop, adding “time.sleep(random.randint(0,7) + 3)” will wait 3-10 seconds before each request. Hopefully, this will prevent me from being caught. On the downside, it means that the whole process is much more time consuming: if every request takes about two seconds, then the average request time is about 9 seconds. Twenty thousand of such requests take 50 hours, just over two days. Still, it might be worth a while to try it out, if not only for the slight possibility that Google’s automatic detection can be avoided.

(Note: there is no guarantee that this script will work for everyone all the time; it is based on the reply I got from Google, from my own personal computer, at a certain time)


One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s