As we’ve been improving support for Ruby on Google Cloud Platform one thing that comes up in our discussions is which gems are used frequently. We want to ensure that common gems install cleanly on the Ruby containers we provide. This means including libraries in the container image for many gems that have native extensions. While we were discussing this, the tester in me wondered, what are the most common 500 gems? Luckily there are multiple sources of data to figure this out.
version_id are foreign keys to other tables that contain the gem name, version string, and other pertinent information.
This isn’t a particularly large dataset but I wanted to put the imported data in a place where others on the team could easily run ad hoc queries. For me, the best place was BigQuery. BigQuery is a managed data warehouse that is part of Google Cloud Platform. It is astonishingly fast at querying large datasets and I can query it using standard SQL.
Getting the data into BigQuery
I used the example load script provided by rubygems.org to load the data into a local PostegreSQL database. Loading the data into BigQuery from there was pretty straightforward. For some of the tables I used the streaming method. Here is the code for loading the gems table.
require 'pg' require 'gcloud' pg_conn = PG.connect dbname: 'rubygems' gcloud = Gcloud.new bigquery = gcloud.bigquery dataset = bigquery.dataset "rubygems" bq_table ||= bq_database.create_table("gems") do |s| s.integer "id" s.string "name" s.timestamp "created_at" s.timestamp "updated_at" end rubygems_cols = %w[id name created_at updated_at] pg_conn.exec("SELECT * FROM rubygems") do |result| result.each do |row| data = Hash[rubygems_cols.zip(row.values)] bq_table.insert(data) end end
Line 21 is a bit confusing. PostgreSQL returns each row as an array of values. BigQuery expects key value pairs. The
Hash transform the array of values into a
Hash that will be sent to BigQuery in the correct format.
The rest of this example is pretty simple. First I require the
gcloud gems. Then I initialize a connection to BigQuery and PostgreSQL. After that, I create the BigQuery table if it doesn’t exist. Finally, I connect to PostgreSQL, extract the data, and insert into BigQuery.
For the rest of the tables in the rubygems.org data I used batch processing. I exported the tables to CSV and then loaded them directly from CSV into BigQuery using the UI. You can also load a CSV, json, or avro file into BigQuery using the
Analyzing the rubygems.org data
Once the data is loaded into BigQuery analyzing it is as simple as writing SQL queries. My primary question was “Which gems have the most downloads?” This query gets the five most downloaded gems:
SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT 5
Here are the results:
I had expected Rails to be the most downloaded gem but it was only number 14.
The gem_downloads table has both a gem_id and a version_id column. Downloads are counted for a specific version of a specific gem so the above query doesn’t give an accurate count. This query sums the counts for all versions of each gem.
SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT 5
This yielded the same top five gems but with higher download counts.
I was also curious which testing library was most popular. It often feels like test library discussions are the editor wars of the ruby community. I’m quite fond of minitest but many folks still use rspec.
SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest', 'rspec')
One last thing the Google Cloud Ruby team was interested in was which versions of Rails are most popular. We all knew someone who was still using an old version of Rails 3. To answer that question I needed to get download counts for each version of Rails.
What we call version numbers, like 3.1.0, are actually strings. In order to get counts for each version I had to extract the major version part of the version string and ignore the minor and patch portions. I used a regular expression to grab everything before the first dot in the version string.
SELECT name, REGEXP_EXTRACT(number,r'(\d)\.') AS major, sum(rubygems.downloads.count) AS total FROM [rubygems.versions] JOIN rubygems.gems ON rubygems.gems.id = rubygems.versions.rubygem_id JOIN rubygems.downloads ON rubygems.versions.rubygem_id = rubygems.downloads.rubygem_id WHERE rubygems.gems.name = 'rails' GROUP BY name, major ORDER BY major
This shows that Rails 3 is the most downloaded and Rails 4 has quite a few downloads as well.
Problems with measuring downloads
This data gave us a reasonable metric for which gems were the most popular. However downloads is not necessarily the best way to measure usage. One company with a lot of servers that they update frequently could skew the data pretty significantly if they download their gems from the web on each install. I wanted to find another way to measure gem usage.
While I was working on this blog post, and the talk that goes with it. I found out that Google released information on nearly 3 million open source repositories from GitHub as a BigQuery public dataset. The next post in this series uses that data to analyze gem popularity and compares how the results differ between the two sources.