Improved GitHub Sync

Josh Kalderimis's Gravatar Josh Kalderimis,

Travis and GitHub love each other, from login to service hooks to marking pull requests as passing or failing, seamless GitHub integration is what makes the Travis experience so awesome.

But sometimes we are left playing catch up when it comes to permission syncing and repository listings. Having to hit the ‘sync now’ button on your profile page just so Travis knows of your new repositories is not uncommon, and sometimes easy to forget.

Today everything got a bit simpler for both travis-ci.org and travis-ci.com users with the introduction of DAILY syncs with GitHub!

Using Sidekiq, an awesome background processing library from Mike Perham, we schedule syncs for all users over a 24 hour period. For both Travis for open source and Travis for private projects we sync around 1,400 users per hour (around 34,000 users total), with a typical user sync (user info, orgs, repositories, and memberships/permissions) taking anywhere between 30 seconds to several minutes depending on how many orgs and repositories a user might have access to.

We are not done with sync, we have a lot to improve and are working on making sure what you see on GitHub is also what you see on Travis.

Lots of jumping High-5’s

The Travis Team


Introducing Addons

Henrik Hodne's Gravatar Henrik Hodne,

If you’ve ever tried to do browser testing against multiple browsers before, you know how much of a pain it can be. Recently we pushed two changes that should make browser testing a bit easier on Travis.

If you’re testing on Firefox, you may require a specific Firefox version. We preinstall a fairly recent version on our VMs, but if you require one that’s older we’ve added a config setting that will download and install a version of Firefox. If you want, say, Firefox 17.0, all you need to do is to add this to your .travis.yml file:

addons:
  firefox: "17.0"

This will download the Firefox 17.0 binaries from Mozilla and link the binary to /usr/local/bin/firefox. We download this straight from Mozilla’s servers, so every version back to 0.8 is available.

If you need to test against more than one version or multiple browsers, then the awesome people over at Sauce Labs may have just what you need. If you’re doing Selenium testing, they have infrastructure set up to help you spin up one of their many, many browsers. They even have mobile browsers to use. The best part? If your project is open source, their Open Sauce plan is completely free.

Part of their setup is called Sauce Connect, which sets up a tunnel from your server to their servers so they can access your webserver. We want testing to be as easy as possible, so we have added first-class support for Sauce Connect right in your .travis.yml config file. In order to enable Sauce Connect, all you need to add to your .travis.yml file is this:

addons:
  sauce_connect:
    username: your_username
    access_key: your_sauce_access_key

The username and access_key fields support encryption, so you don’t have to put them in your repository as plaintext. Read more about this in the addons docs.

This is available on both Pro and Org VMs today. Go try it out!


Post Mortem: Recent Build Infrastructure Issues

Mathias Meyer's Gravatar Mathias Meyer,

Last week we’ve been having quite some trouble with our build infrastructure and with our API and parsing .travis.yml files, causing significant downtime and spurious errors on both platforms for open source and for private projects. We’re sorry our users and customers have had to wait for their builds to run for several hours. While this didn’t affect customers still running on our previous build setup, it still affected a lot of you.

I wanted to take some time to explain what happened and what we did to fix it.

What happened?

Last Monday, April 15 around 20:15 UTC, we received customer reports of builds on their private projects not properly running. We looked into it right away and found that creating new VMs to run the tests on started failing for a majority of all requests.

We found that there were significant problems for the VMs to acquire IP addresses on their respective networks. The underlying IPv4 addresses weren’t reclaimed by the system fast enough to keep up with our load of creating VMs. A set of blocks runs on a common subnet, which has only a maximum of 254 IP addresses available, commonly less than that, as it’s a shared infrastructure.

We create around 30,000 VMs per day, the majority of them on travis-ci.org where they’re much more short-lived. So when the infrastructure system can’t reclaim them fast enough and there’s not enough capacity available, creating new VMs will fail.

We tried a few mitigations, including reducing the number of concurrent VMs created, but nothing seemed to help. Unfortunately, we failed to realize the significance of these problems, and continued trying smaller things to get things running again.

It took us a while to recognize that we need to take all build capacity offline to allow for our infrastructure to recover, as a large backlog of VM creation jobs remained queued, waiting to be run, delaying any possible recovery.

We fully stopped all processes that were continuing to request new VMs and let the queues drain.

After about five hours of trying fixes without any success, we finally made the call to move a majority of the infrastructure to different subnets. At around 2:00 UTC, the move was done, and we brought most build capacity back online.

VM deployments started running again without any issues, and after half an hour of tailing the logs and watching for any more errors and not seeing any issues, we called it a night and went to sleep.

Unfortunately, in the meantime, issues on travis-ci.org started to pop up. We deployed a change earlier that night that would allow us to create VMs on an IPv6-only network, a change that would help us reduce the problems we saw that night.

Unfortunately not all of our infrastructure was yet fully prepared to run on IPv6 networks, so VMs started failing as well, with a backlog of build requests having increased over time.

There were some hosts that didn’t have IPv6 addresses yet, and assigning them one fixed this particular issue and we could let builds run again.

There was a bug in our code that caused a lot of requeued builds (which we do when we fail to allocate a VM) to flood our system and hindered any new builds from being scheduled. To fix this we increased the parallel processing handling these particular messages, which in turn led to a lot of builds not finishing visibly, even though the underlying jobs finished. A race condition caused the build status to not be propagated properly. We’re still investigating a proper fix for this particular issue.

When these issues got their temporary band-aids, builds were running again on travis-ci.org.

Meanwhile, on Tuesday afternoon, a lot of API requests hitting api.travis-ci.org started returning timeouts. There were a few database queries that caused significant delays, most notably the query fetching the currently running and queued jobs in the right sidebar. We had to remove the sidebar element temporarily to make sure that all the other requests were handled in time. We identified a few more slow-running queries and added indexes to speed them up. We have yet to add the sidebar back, but it’s not forgotten.

On Wednesday night at around 22:30 UTC, we noticed VMs failing again on travis-ci.org and decided right away to reduce the available build capacity to make sure at least a minority of builds were able to run without failing. This unfortunately meant that a lot of builds were queuing up during our busiest hours of the days, but it was a call we had to make at the time.

We made the call to put all our efforts on moving our entire build setup to use IPv6 only, as the address space is significantly larger and the speed of reclaiming IP addresses matters a lot less.

We updated our code to request VMs with IPv6 addresses only, upgraded the underlying VM images and tested the code successfully on our staging systems within a few hours. In the afternoon, we were ready to deploy the change when we got reports of builds running on the wrong language platforms.

We suspected that there were issues fetching or parsing the .travis.yml files and started investigating in the logs. We had initial trouble locating the error, as our logs didn’t show any errors. After investigating further, we found that requests fetching the .travis.yml failed with an unusual HTTP status for which our code wasn’t prepared yet.

We prepared a fix in the library we use to talk to the GitHub API and deployed it. Within two hours we had the fix out and could start builds again. Unfortunately builds that were handled in the meantime couldn’t be restarted as we currently don’t fetch the config again when a build is restarted. These builds shouldn’t have made it into the system in the first place, but we didn’t treat this particular status code as an actual error, which would’ve caused the build request to be requeued and retried later.

After we confirmed that builds were running again and the correct configuration settings were used, we deployed the change that would switch travis-ci.org to use IPv6 only later that night. Meanwhile we moved another set of hosts on travis-ci.com to new subnets with more IPv4 addresses available until we could confidently deploy the IPv6 change there as well.

Both platforms are running on entirely separate build setups, which is also true for the underlying build infrastructure. We wanted to make the switch on travis-ci.com with the confidence that the IPv6 only setup worked well on the open source platform.

The IPv6 change was successful, and at around 22:00 UTC last Thursday, all of travis-ci.org was running against VMs talking to them only via IPv6. The VMs itself can still utilize IPv4 connections as the network allows for the IPv6 addresses to be NAT’ed for requests hitting IPv4 resources, most notably required for package and dependency mirrors.

On Friday, we found several builds that needed to be requeued or updated to reflect their overall status properly, but we could confirm that all VM deployments were successful, and we ran on full capacity again.

On Monday the 22nd, in the afternoon hours, we added more capacity to our travis-ci.com cluster in preparation of sunsetting our old build infrastructure entirely. Unfortunately the IPv4 address issues immediately started popping up. We immediately took down a portion of our build processors to reduce the impact, but VM deployments were still failing.

While we had planned to move travis-ci.com to IPv6 the next day, we decided to make an emergency switch to avoid any more significant downtime.

That switch was successfully deployed at 20:00 UTC, so we could turn on builds again.

Since then, our VM deployments haven’t seen any significant issues. We’re still investigating some problems getting particular services running on the IPv6 setup.

The mitigation

The major change required to reduce the address allocation issues is now deployed on both platforms, successfully so. This change will allow us to add more capacity more reliably and without having to worry about any more conflicts in the future.

We still have homework to do when it comes to throttling our own code when VM deployments fail. While this particular problem is unlikely to pop up again, VM creation can still fail due to other issues, and we’re working on making the code handling that more resilient. It shouldn’t continue hitting the underlying API and requesting more builds if a majority of requests are failing.

On top of that, our alerting didn’t notify us soon enough of the increased number of failures. We’re working on improving this, so we can catch the issues sooner.

Regarding the issue fetching the .travis.yml file, we’ve already improved the code to handle this particular status code better, but we need to look at different return codes from the GitHub API in more detail to decide which ones need to be assumed a failure and retried again later.

As for PostgreSQL, we’re continuing to investigate slow queries so we can bring back the missing element in the sidebar with confidence.

It’s been a rough week, and we’re very sorry about these issues. We still have a lot of work to do to make sure both platforms can continue to grow and handle failure scenarios with a wider impact better than they currently do. There are a lot of moving parts in Travis CI, and the only right thing to do is assume failure and chaos, all the time.

Much love,

Mathias and the Travis team.