Ruby Meets BigQuery: Part 1

This is part of the Playing with Data series.

As we’ve been improving support for Ruby on Google Cloud Platform one thing that comes up in our discussions is which gems are used frequently. We want to ensure that common gems install cleanly on the Ruby containers we provide. This means including libraries in the container image for many gems that have native extensions. While we were discussing this, the tester in me wondered, what are the most common 500 gems? Luckily there are multiple sources of data to figure this out.

Rubygems.org data

Rubygems.org provides sanitized weekly dumps of their PostgreSQL database. The dump includes download count by gem and version. The downloads table is called gem_downloads and here’s the schema:

Column Name Type
id integer
rubygem_id integer
version_id integer
count bigint

rubygem_id and version_id are foreign keys to other tables that contain the gem name, version string, and other pertinent information.

This isn’t a particularly large dataset but I wanted to put the imported data in a place where others on the team could easily run ad hoc queries. For me, the best place was BigQuery. BigQuery is a managed data warehouse that is part of Google Cloud Platform. It is astonishingly fast at querying large datasets and I can query it using standard SQL.

Getting the data into BigQuery

I used the example load script provided by rubygems.org to load the data into a local PostegreSQL database. Loading the data into BigQuery from there was pretty straightforward. For some of the tables I used the streaming method. Here is the code for loading the gems table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
require 'pg'
require 'gcloud'

pg_conn = PG.connect dbname: 'rubygems'

gcloud   = Gcloud.new
bigquery = gcloud.bigquery
dataset  = bigquery.dataset "rubygems"

bq_table ||= bq_database.create_table("gems") do |s|
 s.integer "id"
 s.string "name"
 s.timestamp "created_at"
 s.timestamp "updated_at"
end

rubygems_cols = %w[id name created_at updated_at]

pg_conn.exec("SELECT * FROM rubygems") do |result|
  result.each do |row|
    data = Hash[rubygems_cols.zip(row.values)]
    bq_table.insert(data)
  end
end

Line 21 is a bit confusing. PostgreSQL returns each row as an array of values. BigQuery expects key value pairs. The zip and Hash[] transform the array of values into a Hash that will be sent to BigQuery in the correct format.

The rest of this example is pretty simple. First I require the pg and gcloud gems. Then I initialize a connection to BigQuery and PostgreSQL. After that, I create the BigQuery table if it doesn’t exist. Finally, I connect to PostgreSQL, extract the data, and insert into BigQuery.

For the rest of the tables in the rubygems.org data I used batch processing. I exported the tables to CSV and then loaded them directly from CSV into BigQuery using the UI. You can also load a CSV, json, or avro file into BigQuery using the gcloud gem.

Analyzing the rubygems.org data

Once the data is loaded into BigQuery analyzing it is as simple as writing SQL queries. My primary question was “Which gems have the most downloads?” This query gets the five most downloaded gems:

1
2
3
4
5
SELECT name, count
FROM [rubygems.downloads]
JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id
ORDER BY count DESC 
LIMIT 5

Here are the results:

name count
rake 107,076,261
rack 100,955,906
multi_json 100,171,080
json 95,715,131
bundler 93,085,862

I had expected Rails to be the most downloaded gem but it was only number 14.

The gem_downloads table has both a gem_id and a version_id column. Downloads are counted for a specific version of a specific gem so the above query doesn’t give an accurate count. This query sums the counts for all versions of each gem.

1
2
3
4
5
6
SELECT name, sum(count) as total
FROM [rubygems.downloads]
JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id
GROUP BY name
ORDER BY total DESC
LIMIT 5

This yielded the same top five gems but with higher download counts.

name count
rake 214,152,212
rack 201,911,759
multi_json 200,342,260
json 191,430,173
bundler 186,172,479

I was also curious which testing library was most popular. It often feels like test library discussions are the editor wars of the ruby community. I’m quite fond of minitest but many folks still use rspec.

1
2
3
4
5
SELECT name, sum(count) as total
FROM [rubygems.downloads]
JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id
GROUP BY name
HAVING name IN ('minitest', 'rspec')
name total
minitest 101,151,246
rspec 77,293,803

One last thing the Google Cloud Ruby team was interested in was which versions of Rails are most popular. We all knew someone who was still using an old version of Rails 3. To answer that question I needed to get download counts for each version of Rails.

What we call version numbers, like 3.1.0, are actually strings. In order to get counts for each version I had to extract the major version part of the version string and ignore the minor and patch portions. I used a regular expression to grab everything before the first dot in the version string.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT 
 name,
 REGEXP_EXTRACT(number,r'(\d)\.') AS major,
 sum(rubygems.downloads.count) AS total
FROM [rubygems.versions]
JOIN rubygems.gems ON
 rubygems.gems.id =
 rubygems.versions.rubygem_id
JOIN rubygems.downloads ON
 rubygems.versions.rubygem_id =
 rubygems.downloads.rubygem_id
WHERE rubygems.gems.name = 'rails'
GROUP BY name, major
ORDER BY major
version downloads
0 2,890,350,351
1 2,064,535,965
2 3,991,436,199
3 16,378,651,989
4 12,662,487,252
5 963,450,117

This shows that Rails 3 is the most downloaded and Rails 4 has quite a few downloads as well.

Problems with measuring downloads

This data gave us a reasonable metric for which gems were the most popular. However downloads is not necessarily the best way to measure usage. One company with a lot of servers that they update frequently could skew the data pretty significantly if they download their gems from the web on each install. I wanted to find another way to measure gem usage.

While I was working on this blog post, and the talk that goes with it. I found out that Google released information on nearly 3 million open source repositories from GitHub as a BigQuery public dataset. The next post in this series uses that data to analyze gem popularity and compares how the results differ between the two sources.