Hybrid customers are unable to create or update deployments
Incident Report for Astro
Postmortem

The Problem

On Wednesday January 17th 2024, some customers were unable to create or apply changes to Airflow Deployments running on Astro Hybrid clusters, which included deploying image or DAG updates to these Deployments. This outage spanned from 22:20 to 23:25 UTC for a total of 1 hour and 5 minutes. The outage impacted a total of 720 individual HTTP requests to Astro API, the main API powering all user interactions with Astro. During this time period, Airflow tasks continued to run and were unaffected.

Ultimately, the outage was caused by an incompatibility between two services that power the Astro control plane: Astro API and an internal service named Harmony. Astro API is our public-facing API that all user interactions flow through, which includes creating, updating and deleting Airflow deployments. Astro API requests that create, update or delete an Airflow deployment then flow to Harmony, which generates and synchronizes Kubernetes manifest files that describe the desired state of a given Astro data plane.

Prior to the release that caused this outage, we were running two distinct versions of Harmony. One version managed the Hybrid data planes, and one version managed the Hosted data planes. When the release that triggered this outage was deployed to production, Astro Hybrid deployment requests from Astro API to Harmony began to fail, due to an incompatibility with the version of the Harmony service that was running. Astro Hosted deployments were unaffected.

Root Cause

This outage was caused by an incompatibility between our external Astro API service, and our internal Harmony service. While our Astro API service is rolled out via an automated deployment process, our Harmony service is upgraded differently, using a canary rollout mechanism. We use this canary system to gradually roll out changes to the data planes under management. At the moment, progressing through the canary rollout requires a human-in-the-loop.

There were two main contributing factors that led to this outage:

First, the changes we deployed to our Astro API were not backward compatible with the currently running version of our Harmony service. Regardless of our canary rollout process, the changes made here should have been able to work with the older and newer version of Harmony being deployed.

Second, the harmony rollout process is completely automated in our Development and Staging environments, which creates a difference between the lower environments and our Production environment. Our Production environment currently requires human intervention to operate the canary rollout process. While investigating this issue, we found that the issue actually did occur in our lower environments, but it was automatically resolved within minutes due to our automated data plane upgrade process. This led to the issue not being discovered until it was released in our Production environment.

What We’re Doing to Prevent this from Recurring

At the lowest level, this issue surfaced due to an API incompatibility. This particular change happened as we were actually simplifying our system. As mentioned above, prior to this release, we were running two versions of our Harmony service, one for Hybrid, and one for Hosted. Unfortunately, as we rolled out the change to consolidate this to a single service, we mistakenly hadn’t made the Astro API change backward compatible with the previous version of Harmony.

Once this outage had been mitigated, our system was actually simpler than it was prior to the outage. We now send both Hosted and Hybrid requests to a single Harmony service, and no longer need to manage the complexity of two different versions. Now that we’ve reduced the surface area of the Harmony service, we are working to harden the service into an Open API specification, allowing our consuming services to share code and take advantage of an auto-generated client. This will help prevent certain types of bugs in this communication path in the future.

Beyond preventing code-level bugs, we are also working to get our Staging and Production environments more aligned regarding the Harmony service’s canary rollout system. As mentioned, if our Staging environment’s data planes were subject to our canary rollout over a longer period of time like Production, we would have caught this issue much earlier in the process.

At a higher level, we are working to build a much more robust system to roll out upgrades to our Data Planes. Instead of the entire Hybrid data plane fleet being potentially exposed to this issue, we could have verified the changes in a smaller subset before rolling the changes out any further.

Posted Jan 30, 2024 - 19:55 UTC

Resolved
Creating and updating deployments will result in failure which could manifest as Internal server errors (status code 500).
Posted Jan 17, 2024 - 23:00 UTC