Ruby Meets BigQuery: Part 2

This is part of the Playing with Data series.

In my previous post I showed how I used BigQuery to figure out which gems are the most popular based on downloads. I also showed how to figure out which versions of gems were the most popular. But using download data had some drawbacks. One big company with a lot of servers that they frequently update (assuming they don’t vendor gems or use a machine image) could easily skew the numbers. Luckily another source of data is available.

GitHub data in BigQuery

Recently, Google Cloud Platform and GitHub made data from nearly three million open source repositories available on BigQuery. This data provides another way to measure gem popularity. I was excited about this because this gave me another way to measure the popularity of a gem. Instead of just looking at the raw number of times it had been downloaded I could also see how many projects included it in their Gemfile (assuming they had one).

The GitHub Dataset is large (more than 3TB). BigQuery can query the whole dataset quickly. Most of my queries took less than 30 seconds. But with every query I’m examining millions of rows that don’t contain Ruby files unnecessarily. Also, querying the whole dataset can get expensive.

To make my queries slightly faster and significantly cheaper I need to restrict my queries just look at the Gemfiles, Rakefiles, and .rb files. An easy way to do that is to extract just those rows into their own dataset.

Here’s the query that extracts all files named Gemfile, Gemfile.lock, or Rakefile. path I put these in one table in my new dataset.

1
2
3
SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE path IN ('Gemfile', 'Gemfile.lock', 'Rakefile')

And this query extracts all the .rb files. I’m using the RIGHT command to compare the last three characters of the file path to the string ‘.rb’.

1
2
3
SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 3) = '.rb'

These went into another table. I separated the Gemfiles and Rakefiles from the .rb files because their formats are different, so thoy require different queries to analyze.

Querying the dataset

Let’s see how many .rb files were extracted.

1
2
SELECT count(*) num
FROM [rb_files]
Count
19861839

So there were 19,861,839 files in the GitHub data set that ended in .rb. Gemfiles are occasionally used by non-ruby projects (for example with Cocoapods) so I expect a sizable number of Gemfiles and Rakefiles as well.

1
2
SELECT count(*) num
FROM [gem_rake_contents]
Count
251420

The counts aren’t particularly interesting. I wanted to know what gems were most common. One way to do that is to parse the Gemfiles and extract the gem names. To do this first I need to split the contents of the Gemfile into lines, which can be done with the SPLIT function. After I’ve split the lines I can use REGEXP_EXTRACT to extract the gem names. Finally, because I want to include dependencies as well as specified gems I’m going to run this query on only the Gemfile.lock.

1
2
3
4
5
6
SELECT REGEXP_EXTRACT(line, r"\s*gem\s['\"](.*?)['\"]") as gem
FROM (
  SELECT SPLIT(content, '\n') as line
  FROM github_ruby.gem_rake_contents
)
HAVING gem IS NOT NULL
gem
sinatra
mongoid
thin
rake
bcrypt-ruby
rspec
rack-test
mongoid-rspec
simplecov
fivemat
factory_girl
timecop

This gives us the gem names but it doesn’t give us any sort of ranking. It just lists all the gem names as it came to them. To figure out which gems are the most popular I need to do some grouping.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT
  gem,
  COUNT(*) AS n
FROM (
  SELECT
    REGEXP_EXTRACT(line, r"\s*gem\s['\"](.*?)['\"]") AS gem
  FROM (
    SELECT
      SPLIT(content, '\n') AS line
    FROM
      github_ruby.gem_rake_contents )
  HAVING
    gem IS NOT NULL )
GROUP BY
  gem
ORDER BY
  n DESC
LIMIT
  10
gem n
rails 33227
jquery-rails 19709
uglifier 18369
sass-rails 17857
coffee-rails 16145
rspec 15829
rake 15683
therubyracer 14081
pg 14029
unicorn 13696

Conclusion

Using the Rubygems.org download data the most popular gems were rake, rack, multi_json, json, and bundler.

name count
rake 214,152,212
rack 201,911,759
multi_json 200,342,260
json 191,430,173
bundler 186,172,479

Using the GitHub data the most popular gems are rails, jquery-rails, uglifier, and sass-rails.

gem n
rails 33227
jquery-rails 19709
uglifier 18369
sass-rails 17857
coffee-rails 16145
rspec 15829
rake 15683
therubyracer 14081
pg 14029
unicorn 13696

The most popular gems aren’t particularly consistent across sources. This is likely because some of the gems (rake and json) are default gems installed with every Ruby install. This will cause them to both have higher downloads than any other gem and also to not appear in Gemfiles since they are assumed to be downloaded already.

This discrepancy is why using two or more data sources is a good idea if you are trying to generalize about what is most popular. During elections seasons people making predictions combine information from several polls and use other data to draw their conclusions. We can do the same thing when determining the most popular Ruby gems by combining data from Rubygems.org and GitHub.

If you want to play with the GitHub data you can access it here.