Twitter github

Searching the GitHub Archive

Since it’s getting close to the end of the year, I’ve been working on generating some “year in review” reports for company related open source activity. As part of this effort, I stumbled across the GitHub Archive project by @igrigorik.

Essentially, it allows you to query the GitHub timeline (note: timeline data is available starting February 12, 2011). As an added convenience, the data is also available on Google BigQuery via a public dataset to make it easy to try out a few queries before downloading all the archive data locally. A couple downsides to Google BigQuery are that  the user interface is a bit clunky and there are query quotas in place (unless you sign up for the service), other than that it’s pretty straightforward to use.

Of course I got distracted from my initial task and started to search for profanity usage across the GitHub timeline in 2012. I mean, who isn’t interested in seeing how many instances of the seven dirty words they can find across the GitHub timeline? It all starts with a simple query (see the gist).

How many instances did I come across?

37,674 as of today (also posted a CSV online).

Here are some of my favorites:

After this distraction, I was happy add the new word “refuckulate” to my lexicon.

Enjoy grokking the GitHub Archive, I highly recommend it!