Unplanned Tracker Downtime Earlier Today

Dan Podsedly • Thursday, October 24, 2013 • News

Pivotal Tracker went down without notice for all users today at 10:50 US Pacific time, for approximately half hour. This happened during a planned release to upgrade our production version of Solr (our search server) , as well as move our Solr-based search service to new hosts, for improved performance as well as improved search functionality. This release, like most of the releases that we do, should not have been noticed or have caused disruption to you in any way.

The release itself failed quickly due to a misconfiguration related to monit, the “watchdog” process that restarts services when they die. Normally, this would not be an issue, due to our automated rollback mechanism, but in this case, the rollback itself failed for a different reason, leaving the system in a bad state.

It took just under a half hour for our team to determine what exactly caused the automated rollback to fail, manually roll back the release, and for all application servers in the cluster to restart.

We are in the process of reviewing this outage in more detail, and are putting together a list of action items. Our staging environment, where we test each release, will be made more similar to our production cluster, and we are going to make rollbacks a regular part of our release testing process. We also discovered some potential areas improvement to our web client application, to make it handle server outages more gracefully.

Please note that we do always have the option to switch over to an alternate environment, in a different data center, without any data loss. But, that option is generally a slower last resort, if can’t recover from an issue with our primary production environment.

Please accept our apology for this outage right in the middle of the work day. We know you (like us) rely on Tracker heavily, and we do everything we can to make sure Tracker is up and running, happily, 24/7, 365 days a year.

P.S. Thanks so much to @bluebox for immediately jumping on this with us, proactively, and helping us get things back to normal.

Category: News