A large amount of time of a normal build is spent installing dependencies, be it
Ubuntu packages or simply running bundle install or npm install on a normal
project. On an average Rails project, this can take a few minutes to succeed
unfortunately.
This is dreadful to watch and it takes precious time that should rather be spent
running the build. We’ve seen some great initiatives from customers to cache
bundles, like bundle_cache
(inspired by our friends at Kisko Labs) and
WAD from the fine folks at
Fingertips.
But in the end, it became clear that we need to offer something that’s built-in,
automatically caching our customers’ bundles without requiring major changes to
their build configuration.
Today, we’re officially announcing built-in caching for dependencies.
While it integrates nicely with Bundler already, you can also use it to cache
Node packages, Composer directories, or Python PIP installations. Heck, you can
even use it to speed up the asset pipeline’s temporary test cache, to speed up
builds even more.
Plus, caching an entire bundle of dependencies has the benefit of reducing the
impact of outages of dependency mirrors.
To give you an idea of the impact of this simple yet efficient way of speeding
up builds, we’ve had customers report saving a total of half an hour or more
on their builds with bigger build matrixes.
On our billing app, it shaved off four minutes, cutting the build time down to
one minute.
How can you enable it for your private projects?
If you’d like us to cache your Bundler directory, simply add the following to
your .travis.yml:
cache: bundler
If you want to add the asset pipeline’s compilation cache, you can specify the
directories to cache as well:
For a Node.js project, simply specify the node_modules directory:
cache:
directories:
- node_modules
The specified paths are relative to the build directory.
Cache Rules Everything Around Me
We’re working on getting more language-specific caching methods included, and
on getting common dependency mirrors closer to our infrastructure,
as we already did with a network-local APT
cache.
Bundler 1.5 has great features upcoming, including Gem Source
Mirrors, which
we’re looking into utilizing to increase the reliability of installing RubyGems.
There are many ways to stay up-to-date with your projects’ build
results; you can visit the web site, or
write your own client by using our API.
We have recently added another.
You can now subscribe to the Atom feed with your favorite
news reader!
How to subscribe
To subscribe to the feed, point your
favorite Atom Reader to the Atom feed URL:
You can substitute travis-ci and travis-core to the repository owner
and name of the repository you would like to subscribe to.
Alternatively, you can also send the HTTP Accept: application/atom+xml
header to the above URL with or without the .atom extension.
(If you do not pass this header, you will get the JSON-formatted data
without the extension.)
In the weeks leading up to the announcements, just in time, we finished up some
long overdue upgrades of our own infrastructure, bringing Travis CI up to par
with the latest versions of PostgreSQL.
For the longest time, Travis CI ran off a single PostgreSQL instance, posing a
challenge for us to both scale up and
out.
Due to unfortunate timing, we’ve been running on a 32 bit server with 1.7 GB of
memory.
This limited our upgrade options, as we couldn’t just bring up a new follower in
64 bit mode based on the archived data of a 32 bit machine. We had to do a full
export and import as part of the migration. This was one of the reasons why we
held off on this upgrade for so long. Initial estimates pointed to a downtime of
almost an entire day.
Amazingly, this single box held up quite nicely, but the load on it, mostly due
to several hundred writes per second, bogged down the user experience
significantly, making for slow API calls and sluggish responses in the user
interface. We’d like to apologize for this bad experience, these upgrades were
long overdue.
First the good news: the upgrades brought a significant speed boost. We upgraded
to a bigger server, a 7.5 GB instance, and we upgraded to the latest PostgreSQL
version, 9.3.1.
Here’s a little preview (and an unexpected cameo appearance) about the results:
This graph is the accumulated response time of all our important API endpoints
in the 95th percentile.
Just in time for our planned upgrades, Heroku shipped their new Heroku Postgres
2.0, with
some new features relevant to our interests.
We had one major problem to solve before we could approach the upgrade. Most of
the data we carry around are build logs. For open source projects, as much as
136 of the 160 GB total database size was attributed to build logs.
Due to their nature, build logs are only interesting for a limited amount of
time. We implemented features to continuously archive and load build logs to S3
a while ago.
But before the upgrade, we had to doubly make sure that everything was uploaded
and purged properly, as we’d abandon the database afterwards, starting with a
clean slate for build logs.
Once this was out and done, we migrated the logs database first. It only
consists of two tables, and with all logs purged, only a little bit of data
remained to be exported and imported.
All migrations took the better of four hours each, an unfortunate but urgently
need service disruption. We kept the maintenance windows mostly to the
weekends
as much as we could to reduce overall impact.
The Results
We were pretty surprised by the results, to say the least.
Let’s look at a graph showing API response times during the week when we
migrated two databases for travis-ci.org.
The first step happened on Wednesday, November 13. We moved build logs out of
the main database and into a set of Tengu instances with high availability.
You can see the downtime and the improvements in the graph above. Overall time
spent waiting for the database in our API went down significantly after we
upgraded to a bigger instance, notable after the second set of red lines, which
marks the migration of our main data to a bigger setup. A great result.
Here’s a graph of the most popular queries in our systems:
What you’ll also notice is that after the first migration, the averages are
smoother than before, less spikey. Times overall initially didn’t really change
significantly, but we removed a big load from our main database with this first
step.
After the second migration, the most popular queries went down dramatically,
almost dropping on the floor.
Here’s the breakdown on what was previously the most expensive query in our
system:
Average call time went pretty much flat after the upgrade from a very spikey
call time with lots of variation, suggesting much more expensive queries in the
higher percentiles.
We can attribute that to much better cache utilization. Cache hit rates for
indexes went up from 0.85 to 0.999, same for table hits, which is now at 0.999
as well.
Thanks to much more detailed statistics and
Datascope, we now get much better insight
into what our database is up to, so we can tune more queries for more speedups.
Unexpected Benefits
PostgreSQL 9.3 brought a significant feature, the ability to fetch data from
indexes.
We saw the impact of this immediately. The scheduling code that searches for and
schedules runnable jobs, has been a problem child as of late, mostly in terms of
speed.
During peak hours, running a full schedule took dozens of seconds, sometimes
even minutes. We analyzed the queries involved, and a lot of time was spent
fetching data with what PostgreSQL calls a heap scan.
This turned out to be very expensive, and the lack of caching memory added the
potential for lots of random disk seeks.
With PostgreSQL 9.3, a single scheduling run, which used to take up to 600
seconds, now takes less than one, even during busy hours.
A great combination of useful new features and lots more available cache memory
gave us some unexpected leg room for build scheduling. We still need to tackle
it, but now have less urgency.
Unexpected Problems
After the migration of travis-ci.com, we noticed
something curious. Build logs for new jobs wouldn’t load, at all.
We quickly noticed a missing preparation step in the migration: check for any
joins that could be affected by having two separate databases.
The query behaviour on travis-ci.com is slightly
different compared to the open source version. We explicitly check access
permissions to protect private repositories from unauthorized access.
This check broke for log, as the permissions check is currently joined into the
query. As we’re working on doing more explicit permissions checks rather than
join in the permissions, it worked out well to fix this minor issue on the spot
and remind ourselves that we need to tackle the overarching issue soon.
Travis CI is now running on a total of eight database servers, four in total to
store build logs, two for both platforms, and two Premium
Ika
for serving the main data for API, job scheduling, and all the other components.
While eight may sound like a lot, four of those are failover followers.
We hope you enjoy a faster Travis CI experience. We still have lots more tuning
to do, but even for bigger projects like Mozilla’s Gaia, with currently more
than 21000 builds, the build history loads almost instantly.
If you haven’t already, check out Heroku’s new
PostgreSQL. We’ve been happy users, and we’re very
lucky to have such great people on their team to help us out.