Finding Your Soul Metric

Mathias Meyer's Gravatar Mathias Meyer,

Travis has a history of failure. While we’re not ashamed to admit that and talk about them in public, we’re also working hard to ensure that we notice early on if something breaks. We want to find out about it before our users do.

Over the last couple of months we’ve added a ton of metrics to improve insight into all parts of Travis. But during these months, we couldn’t find the single metric that would tell us if the system is doing what it’s supposed to do, which is of course to run your builds. It took us quite a while to finally figure out how to measure something that doesn’t happen. That caused me quite a headache, even though, in hindsight, it was so obvious.

Just last week, we had three unavailabilities due to networking issues on EC2 and two DDoS attacks on GitHub. Earlier this week, EC2 crapped out once again, taking part of our infrastructure with it.

There was a difference with all these outages. Even though there wasn’t much we could do about them, this time we were alerted about them well in advance. Fun fact: we found and added the metric that defines whether or not Travis is running properly. As embarrassing as they were for us, at least we knew about them when they were happening.

Enough blabber, what’s the metric already? After getting repeatedly notified by users that builds are not being run, it finally hit me one day. The simplest metric that we can track that shows us how the system is doing is the time since the last build started.

Yes, it turns out to be that simple. Turns out that this it the sole defining metric that helps us keep a watchful eye on Travis. Just to be safe, we added the time since the last build finished. As a general metric we’re also tracking the number of message currently ready to be consumed in RabbitMQ. These metrics give us a rough clue that something is failing in our infrastructure. Here’s a recent snapshot of what they look like when correlated:

Graph of Messages vs Build Started/Finished

Here’s a more recent graph that shows a database latency spike not long ago and queues backing up as a consequence. We’re still investigating why this happened, and we’ll move to a bigger database setup soon as well. We’re also spreading out the work load even more to keep up with it better.

Graph of Log Messages vs Log Updates

How can I find my soul metric?

Why did this take so long to figure out? The most common question I get from folks is what kinds of metrics they’re supposed to track for their application.

Unfortunately, there is no obvious answer here. If you have a simple web application, your most common nominator is the number of web requests, and the response time.

If you have an app that sends a lot of email, your metrics could be the amount of email being sent, time of an email spent in the delivery queue, time of network requests to remote email servers, file system utilization. Your core metric would probably still be the numbers of email sent. If email is being sent, the system is in a nominal state. If no email is sent, something is likely to be wrong. Add the number of emails waiting in the queue and you have data to correlate with.

Sometimes your core metric just isn’t obvious in the first place. For us, the metric only became obvious after we realized that what users would mostly report to us is that there haven’t been any builds running for several hours.

In Travis, every component has its own core metrics. For instance, we have a process that processes the build logs. From above, the most important metric is the number of log messages waiting to be processed. To be able to track this, we recently started collecting more detailed metrics for every single queue in RabbitMQ.

On Tuesday morning, when EC2 was still having issues, we noticed that log processing was way behind. Thankfully, we had a second metric in place to track the averages and 95th percentiles of ActiveRecord updates. Lo and behold, the 95th percentile for these went through the roof, up to 6 seconds for a single write to the database. Another fun fact: the metrics for single queues was just added the day before. In fact, I put them in production when the troubles started and our alerts immediately went off. Here’s something visual that shows the latency spiking and the number of messages queueing up in our logging queues as a result:

Graph of latency spike

Notice the latency going up (green graph) and messages piling up. Suddenly the latency went back to almost being normal, and the queues were slowly drained.

Finding your soul metric can take time. But in the end, it boils down to what matters most to your users. What’s most valuable to them can easily turn into exactly what you need to track to figure out if, from the user’s point of view, everything is still working.

Tell me about your monitoring and alerting?

Our infrastructure is still mostly based on what I wrote about earlier this year, with a few additions.

To track custom, external metrics like the ones from RabbitMQ or determined by database updates, we built a small monitoring app based on Celluloid. It polls data sources every 20 seconds and pushes them directly to Librato Metrics.

As these metrics are critical to running the platform, a resolution of 20 seconds is pretty good, though we might take them further down to 10 seconds in the future.

That little app also does threshold-based alerting to our Campfire room. I’m planning on adding support for Pagerduty and Prowl soon too.

travisbot messages

While there are existing tools for monitoring, metrics collection and alerting, there’s no shame in experimenting with your own, if you’re aware of the downsides. Monitoring mostly still sucks, to get better, more people need to play with more ideas.

If you want to chat monitoring and metrics, Travis or coffee, I’ll be in NYC next week for devopsdays.


Introducing Travis CI's next generation web client

Sven Fuchs's Gravatar Sven Fuchs,

If you think about Travis CI, the web client is the face of the project. You look at it a lot.

The “live” nature of our UI, updating the repository timeline as new builds come in, streaming the logs to the browser while a test suite is being run, updating job queue etc. has always been a great asset for Travis CI.

People just love it:

… and the “Mac OS X style desktop application” metaphor has proven to work quite well over time.

At the same time we are all aware of the issues that our client had over the last few months, especially since traffic on Travis CI has exploded.

We are now running about 10,000 builds a day for nearly 25,000 repositories and our old client could just not hold up with that. Especially in situations where our queues had run full and other, seemingly “unrelated” issues occured and the whole client just stalled on Chrome.

On top of the scalability issues our codebase made it pretty hard to fix things and add new features. Instead, one had to fight with really old library versions that could not be upgraded easily, a ton of work-arounds and even things like a slow asset compilation pipeline.

Even though the client as such was still great, people became increasingly unhappy with the issues it started to show. And rightfully so. We felt just the same!

So today we’re really happy to announce Travis CI’s next generation web client.

The next generation web client

The new Travis CI web client is a complete reimplementation of the proven feature set of the last version. Development started four months ago and a huge amount of work has been put into it over the last few months.

It is currently running on https://next.travis-ci.org and we will keep the old version running on http://travis-ci.org for a short while until the new one has proven under higher load and remaining issues are hashed out.

New features

With this new version our main goal was to reimplement the same, well known and battle tested functionality that the old client had, but make it based on the latest versions of ember and ember-data, improve its performance, overall stability and usability.

We are very happy that this goal has been achieved. The new client feels much more snappy and performs quite well even in situations where our queues see lots of traffic and start backing up.

Still, we also were able to put in some goodies and new features:

No more hashbang URLs, finally

We have never been very religious about this but there always were good arguments that made us want to move away from the hashbang URLs that Travis CI has started with (and frankly, worked great despite using them).

With the move to the new, shiny Ember router and the new API endpoint there was no reason to stick with them and we changed our URL scheme to plain, simple (and some would say “non-broken”) URLs:

URL scheme in location bar

Requeue builds button

This is probably one of the most requested features recently: a way to re-trigger a build through our UI. This will come in handy if you have a build that stalled or misbehaved for some reason and you want to re-run it:

Flash messages

If you use that feature you will also notice that we also finally have a generic way of displaying flash messages:

This yet has to be used for other interactive features such as turning service hooks on/off and synchronizing data from GitHub.

Design details

We have also put in a significant amount of attention to further iterate on the look and feel.

Especially when running on smaller screens the old client was not really great to use. We have now tweaked font sizes, widths and margins and made good use of css media queries to respond to different screen sizes gracefully.

Even though this can still be improved (especially for very small screens, we will happily accept pull requests) we believe it’s a huge improvement:

Other design touches

The profile page (now renamed to “Accounts”) got a complete overhaul:

If you’re into CSS then you might notice that the service hooks on/off switch is done entirely in CSS without using any images.

The somewhat weird behaving and flickering feature that displayed repository descriptions when mousing over a repository entry in the left sidebar has been replaced by an “Info” button which reveals all repository descriptions:

We have also ressurected the long buried selection indicator for the repository list on the left as well as the accounts list on your profile page. And one can now expand and collapse all workers in the right sidebar with one click:

And finally there’s a more generic way to display popups:

Coming to Travis CI Pro soon

We already prepared integrating this client for Travis CI Pro so you can expect it to be ready soonish.

Travis CI Pro is still in private beta, but if you are interested in trying it out then drop us an email to support@travis-ci.com.

Thank you, Ember.js!

The new client is built on top of the latest version of Ember.js, instead of the pre-alpha Ember version that actually still was called Sproutcore 2.0 at that time. It uses the much faster and lighter Ember Data for storage, instead of the old and very, very crufty Sproutcore Datastore. And it uses the new, shiny Ember Router, not our own port of the old Sproutcore router.

All the improvements that have gone into the underlying Ember libraries combined with the removal of hackish workarounds on our side make up for tremendous improvements not only in terms of speed and stability, but also ease of development and way more consistent code everywhere.

We would not have been able to ship this new client without the terrific help of the Ember.js core team, most importantly Yehuda Katz Tom Dale Peter Wagenet Paul Chavard, as well as Adam Hawkins who helped us come up with a solid and fast implementation of our asset compilation pipeline. Thank you so much, guys :)


Engine Yard Sponsors Piotr Sarnacki to Work on Travis

Mathias Meyer's Gravatar Mathias Meyer,

Engine Yard has always been a great supporter of open source, the Ruby ecosystem, and Travis too. So today we’re thrilled to announce that Piotr Sarnacki is being sponsored by Engine Yard to work on Travis CI.

Piotr is a Rails contributor, he’s made Rails engines awesome and, depending on the angle, he looks a bit like Nicolas Cage.

Engine Yard have been avid supporters of Travis already, having joined the ranks of sponsors during our Love campaign. Now they’re enabling us to push Travis further and bring more goodness to the open source community!

With the sponsorship, Piotr joins the Engine Yard OSS Grant Program.

Already, he’s built support for secure environment variables, allowing you to store important credentials in your .travis.yml securely.

Recently, he’s been putting a lot of great work into our new user interface, now fully based on Ember.js. Expect a blog post on it very soon!

Welcome to the team, Piotr. And thank you so much for the sponsorship, Engine Yard. We’re looking forward to seeing more goodness being shipped soon!