Post-Mortem: Pull Request Unavailability

Mathias Meyer's Gravatar Mathias Meyer,

As I’m sure you’ve noticed, it hasn’t been a great week for our availability.

First up, we’re sorry for having had yet another outage to report about. But we’d like to keep you in the loop about what happened, why it happened, and what we’re going to do to make sure that it doesn’t affect us like this again.

What happened?

Last Wednesday, in the European afternoon, we started seeing an increased number of errors hitting the GitHub API. We kept getting 403 errors hitting API endpoints that used to work perfectly well before. The errors went away after a while. We were left scratching our hands about this sudden anomaly, but things were apparently running smoothly again after that.

On Thursday, things suddenly turned for the worse. Requests to the GitHub API kept stalling incoming build requests entirely, and we’d see an abundance of 403 errors in our logs again. So we continued investigating.

At 8 pm CEST, things came to a grinding halt, and no build requests were processed at all. We started investigating the cause, and bumped a lot of corners on the way.

The main component that’s scheduling and distributing builds, kept stalling on us, no matter how we looked at it. Running on Heroku, we unfortunately didn’t have enough means to get a JVM stack trace of that component. The component runs on JRuby, and Charles Nutter was very helpful and pointed us into several directions.

First up, we saw timeouts in SSL connections, and specifically only when fetching meta data about a pull request. So we kept investigating in this direction and tried to find out why our internal timeouts didn’t fire.

We ended up putting this component on external cloud capacity that we have available thanks to Bluebox (why we have this capacity is a story to be told in a separate post). We deployed the component on JRuby 1.7.0-preview2 on a server we have physical access to.

So we were able to start debugging the issue further by inspecting thread dumps and profiling the JVM using VisualVM. We still kept running into the same issues, timeouts not being hit. But at least we were able to more reliably get timeouts eventually.

At around 2:30 am CEST, things were looking up, and we left things running on its own, on the new server. Queue processing seemed to go smoothly until we started looking back into things the next morning to clean up things left over from the night before.

We still noticed an inadequate number of 403 errors in our logs. Thanks to Wynn from GitHub, we finally added code that allowed us to track what API endpoints were being hit, and how our rate limit developed over time. That data is now always attached to errors when we access the API, so we’ll have a track record of things going wrong.

We were still seeing stalling requests for pull requests, but we reduced the timeouts so that they didn’t affect the other build requests for too long. Having these details in place we realized that we kept hitting the GitHub API unauthenticated and therefore hit the default, IP-based rate limit of 6000 requests very quickly.

We added longer timeouts so we didn’t hit the API as often, and started using our OAuth client credentials, so that all requests count against our application’s rate limit instead of the IP-based one.

Things went back to normal again for a while, until Friday night. The timeouts kept getting worse and worse until eventually, all requests that hit the pull requests API were timing out on our end.

We finally pulled the plug at 2:00 am CEST on Saturday morning and disabled pull request testing entirely, until we’d found out why we suddenly kept seeing these timeouts. Things were then running smoothly until Monday morning, when that component ran out of memory. We doubled the available amount and things have been running smoothly since.

After investigating more, working together with the GitHub team, we found the root cause of why things were stalling.

Why did it happen?

When we get a pull request, we hit the GitHub API, an undocumented part no less, to determine the merge state of a pull request. We only test pull requests that are mergeable, so when a pull request is green to merge on GitHub, we will test it as well. The reason is simple: we can’t automatically test a pull request when we already know that merging it would fail and require manual interaction.

To determine the status, we continuously poll the GitHub API. Until last week’s events, we hit the API every 0.1 seconds. Usually, an updated status wouldn’t take longer than a second.

Last week, that part suddenly took longer than 10 minutes, which triggered our global timeout, but also meant that we could basically just accept six build requests per hour. Processing requests is currently a serial process, working through one request at a time.

You can start doing the math on this now. For 10 minutes, we’d hit the GitHub API in 0.1 second intervals. If we requests take 100 ms, we’re at 3000 requests in just ten minutes. Two pull requests and our rate limit is gone. That meant 403 errors everywhere for the rest of the hour. After that, things started from scratch, but we ran into a wall pretty quickly again.

As it happens, we had two timeouts in place. One that was 10 minutes, which was implemented in gh, our own GitHub API client library, and another one was setup in the aforementioned component, set to 60 seconds. We were quite baffled why the smaller timeout that wrapped the bigger one didn’t trigger. Turns out, this is an oddity in JRuby’s implementation of the Timeout class.

On JRuby, if there’s a nested timeout that has a longer timeout value than the outer one wrapping it, the nested timeout will have precedence. This is not the case on MRI, but happens reproducably on JRuby, both 1.6.7 and 1.7.0-preview2. This is why pull requests kept stalling things for so long. The GitHub status of a pull request never changed into a state we can handle, and we had a very long timeout, wrapped by a smaller timeout, with the longer timeout having precedence and causing things to stall for 10 minutes at a time.

Meanwhile, we hit the API so often, we ran into our rate limit immediately.

The initial steps we took was to lower the timeout of 10 minutes to just 180 seconds, and increasing the intervals between polls of the GitHub API. We also use our OAuth credentials with every request to avoid running into the IP-based timeout.

We also added a feature flip along the way that allows us to enable and disable pull request builds quickly, should this situation come up again in the future.

Coming to the route cause of all of this, it was an unlikely set of circumstances that hit at the most unexpected time, as they always do. An increased delay of processing on an internal part of the GitHub API brought up timeout issues that in turn brought things to a halt because we only process one build process at a time, a cascading failure throughout the Travis system, if you will.

No matter how you look at it, this was bound to happen eventually, and we’re very sorry that it did, and that we didn’t notice the dangers in this part of Travis sooner.

You don’t simply rely on an external API to always work and respond in the way you expect, and in a timely manner too. APIs tend to break every now and then, they become unavailable temporarily, and they can change in unexpected ways. We’re at fault for not considering the implications of this sooner.

We want to make sure this doesn’t happen again, or at least make sure that increased API delays don’t affect the overall system.

What are we going to do about it?

While GitHub has fixed the issue on their end (thank you, guys!) we’re still left with things to clean up to make sure an problem like this doesn’t cascade through the entire Travis system again.

Apart from the decreased timeouts, increased intervals between polls, included OAuth credentials to reduce the likelihood of running into rate limits, we’re adding exponential back-off to the gh code base to increase the interval with every request we have to make for a single pull request.

Work as also started to pull processing build requests out of the hub component, which currently carries a lot of responsibility. It’s our priority to remove as much of that responsibility from its runtime and break it out into smaller apps.

A while ago we wrote about our issues with log updates and the app we broke out of the hub to deal with that. We’re going to continue that trend, and build request processing is the next step.

That component will allow parallel processing of build requests and will most likely separate handling of pull requests from normal pushes, as both use very different API endpoints.

To make sure that things don’t just continue running when we see an increase in API errors, we’re going to track the errors in total and trip a circuit when they reach a threshold.

We’ll also make sure that failures hitting the API are retried later so that we don’t just give up on a single build. Your builds are important to us, and we want to make sure we run every single one of them!

The new component uses Celluloid under the hood, so we get benefits like thread supervision, timeouts, exit handling of linked threads, and very simple message passing to hook actors together.

We have these things lined up and are actually starting to roll them out rather soon, but that doesn’t make up for the lost builds unfortunately. We’re sorry about them, we’d have liked to make sure they at least stick around until issues are resolved, but in this case, the issue was too big. Catching up on everything that’s in our queues after a weekend would’ve taken a very long time. It’s why we’re looking into alternatives to VirtualBox that allow us to be a lot more flexible when it comes to running builds.

We’ll be enabling pull request testing again first thing in the morning, so we can keep a close idea on stability over the day.

Love, the Travis team!


An Update on the Site's Availability

Mathias Meyer's Gravatar Mathias Meyer,

As I’m sure you’ve noticed, Travis CI wasn’t doing its best recently. The site kept being unavailable at irregular intervals. We’re sorry for these unavailabilities, and we wanted to keep you in the loop of what has happened and what we’ve done about it. Or rather, what we’re still doing about it.

The only part really affected by the unavailabilities was the web site. The rest of the app hummed along rather nicely. So we could isolate the problem already to that part of the app.

What happened?

While we don’t have enough monitoring data available yet for our PostgreSQL instance, the fine folks at Heroku gave us enough clues to figure out that the database is part of the problem.

The Travis CI database now exceeds 40 GB of data. That doesn’t sound like a lot, but when running on a comparably small PostgreSQL/EC2 instance, things start to break.

We now run around 10,000 builds per day for more than 20,000 active open source projects. That’s a pretty staggering number, and we’re seeing the effects of the size Travis has grown to left and right. Which is to say: things break much more easily right now, and we’re working hard on improving these things, breaking out smaller apps, loosening up responsibilities, and simplifying.

Long story short, on Tuesday we took down the database and upgraded to the next biggest instance, giving us more memory and a few more cores. The downtime took longer than expected, because PostgreSQL took longer to restore from the archives than we thought.

After we started things back up slowly, the site was immediately hit with unavailability again, leaving requests hanging for very long.

This is the most embarrassing but at the same time rather crazy part: Travis has been running off three Unicorn processes until now. I know, right?

We made a rookie mistake of not properly looking at the request queue. Granted, we didn’t see anything in our NewRelic metrics, that would’ve pointed to the problem. But that’s no excuse. We increased the number of dynos to serve the site, et voila. Speed!

The site is now pushing a larger amount of traffic while the database response times remain pretty much unchanged. We’ve seen speedups in some parts, but overall, the performance remains the same, but we can more reliably serve a bigger amount of requests.

It might not have been the last database upgrade though. The Heroku Data team has suggested we upgrade to at least a 64bit instance, and at some point, that’s what we’ll be doing.

The parts of Travis that didn’t touch the database hummed along nicely too, by the way. We still accepted commits and pull requests from GitHub, queued them, and worked through the backlog once things normalized.

We’ll work on getting more metrics out of PostgreSQL into our dashboards, in particular the cache hits and slow queries overall so we can keep a much better eye on this side of the app.

What next?

We’re still working on several issues, now that this is out of the way. A big part of the traffic that Travis is serving are the build status images.

In the future, these images are more likely to be served by a combination of cached build statuses and a CDN behind it.

The current frontend is a bit wonky. Folks using Chrome have been noticing that in particular, and we’re sorry for that. Sven and Piotr have been working on a new client that puts these issues to rest and makes for a much more resilient and speedy user interface overall. You can try it out yourself. It’s still a work in progress, and you can contribute or open issues on the repository.

One part of the problems with the user interface could be seen when there were lots of builds waiting in the queue. That dragged the performance down quite significantly, so we had to remove some parts that made things really slow. We’re looking into ways of bringing them back, but we had to make a call to improve the overall performance, at least until we find ways to speed things up.

Meanwhile, Konstantin has been working on breaking out the API for Travis. The new client will talk exclusively to this new API.

We have other parts where we’re actively working on getting things more stable. VirtualBox is another one of them, but that deserves an entire post all on its own.

We’ll keep you posted when more bad or good things happen!


Auto-Tailing Logs

Sven Fuchs's Gravatar Sven Fuchs,

From the very beginning one of Travis CI’s most outstanding features was streaming logs from test runs that happen on workers to the client application which runs in your browser. It can be just mesmerizing to watch your tests run live on Travis CI:

Many repositories use a short one-dot-per-test style output so watching the logs while tests always worked great. But many other repositories use a much more verbose log output so watching their logs live requires to continously scroll to the bottom of the page as new log output comes in.

Thus, one of the most requested features has been a switch that would make the page stick to the bottom of the log file.

Today we have shipped this feature. It still needs some polishing but it should do its job so we thought we shouldn’t hold it back. Watch the demo video from our feature pull request:

… or simply head over to Travis CI and try it out.

Of course this feature is also available on Travis CI Pro starting today.

Cheers!