Ruby Meets BigQuery: Part 2
In my previous post I showed how I used BigQuery to figure out which gems are the most popular based on downloads. I also showed how to figure out which versions of gems were the most popular. But using download data had some drawbacks. One big company with a lot of servers that they frequently update (assuming they don’t vendor gems or use a machine image) could easily skew the numbers. Luckily another source of data is available.
GitHub data in BigQuery
Recently, Google Cloud Platform and GitHub made data from nearly three million open source repositories available on BigQuery. This data provides another way to measure gem popularity. I was excited about this because this gave me another way to measure the popularity of a gem. Instead of just looking at the raw number of times it had been downloaded I could also see how many projects included it in their Gemfile (assuming they had one).
The GitHub Dataset is large (more than 3TB). BigQuery can query the whole dataset quickly. Most of my queries took less than 30 seconds. But with every query I’m examining millions of rows that don’t contain Ruby files unnecessarily. Also, querying the whole dataset can get expensive.
To make my queries slightly faster and significantly cheaper I need to restrict my queries just look at the Gemfiles, Rakefiles, and .rb files. An easy way to do that is to extract just those rows into their own dataset.
Here’s the query that extracts all files named Gemfile, Gemfile.lock, or Rakefile. path
I put these in one table in my new dataset.
And this query extracts all the .rb files. I’m using the RIGHT
command to compare the last three characters of the file path to the string ‘.rb’.
These went into another table. I separated the Gemfiles and Rakefiles from the .rb files because their formats are different, so thoy require different queries to analyze.
Querying the dataset
Let’s see how many .rb files were extracted.
Count |
---|
19861839 |
So there were 19,861,839 files in the GitHub data set that ended in .rb. Gemfiles are occasionally used by non-ruby projects (for example with Cocoapods) so I expect a sizable number of Gemfiles and Rakefiles as well.
Count |
---|
251420 |
The counts aren’t particularly interesting. I wanted to know what gems were most common. One way to do that is to parse the Gemfiles and extract the gem names. To do this first I need to split the contents of the Gemfile into lines, which can be done with the SPLIT
function. After I’ve split the lines I can use REGEXP_EXTRACT
to extract the gem names. Finally, because I want to include dependencies as well as specified gems I’m going to run this query on only the Gemfile.lock.
gem |
---|
sinatra |
mongoid |
thin |
rake |
bcrypt-ruby |
rspec |
rack-test |
mongoid-rspec |
simplecov |
fivemat |
factory_girl |
timecop |
This gives us the gem names but it doesn’t give us any sort of ranking. It just lists all the gem names as it came to them. To figure out which gems are the most popular I need to do some grouping.
gem | n |
---|---|
rails | 33227 |
jquery-rails | 19709 |
uglifier | 18369 |
sass-rails | 17857 |
coffee-rails | 16145 |
rspec | 15829 |
rake | 15683 |
therubyracer | 14081 |
pg | 14029 |
unicorn | 13696 |
Conclusion
Using the Rubygems.org download data the most popular gems were rake, rack, multi_json, json, and bundler.
name | count |
---|---|
rake | 214,152,212 |
rack | 201,911,759 |
multi_json | 200,342,260 |
json | 191,430,173 |
bundler | 186,172,479 |
Using the GitHub data the most popular gems are rails, jquery-rails, uglifier, and sass-rails.
gem | n |
---|---|
rails | 33227 |
jquery-rails | 19709 |
uglifier | 18369 |
sass-rails | 17857 |
coffee-rails | 16145 |
rspec | 15829 |
rake | 15683 |
therubyracer | 14081 |
pg | 14029 |
unicorn | 13696 |
The most popular gems aren’t particularly consistent across sources. This is likely because some of the gems (rake and json) are default gems installed with every Ruby install. This will cause them to both have higher downloads than any other gem and also to not appear in Gemfiles since they are assumed to be downloaded already.
This discrepancy is why using two or more data sources is a good idea if you are trying to generalize about what is most popular. During elections seasons people making predictions combine information from several polls and use other data to draw their conclusions. We can do the same thing when determining the most popular Ruby gems by combining data from Rubygems.org and GitHub.
If you want to play with the GitHub data you can access it here.