Airflow deployments unhealthy due to scheduler issues

Incident Report for Astro

Postmortem

On October 1, 2025, many Astro deployments were suddenly reporting as unhealthy. We received nearly simultaneous customer reports as well as internal monitoring indicators around 12:30 EDT that something had gone wrong and promptly began our incident management process. Given the widespread nature of the issue, across deployments on multiple cloud providers and Apache Airflow versions, we looked for anything that had changed across all of Astro. There was only one change made today, approximately one hour before issues were reported. This change was to updates.astronomer.io/astronomer-runtime, which various Astronomer tools use to determine when new versions of Runtime are available and for how long they are supported.

We made a long-expected change to the format of this table to better align with our policies and practices (which are outlined at https://www.astronomer.io/docs/runtime/runtime-version-lifecycle-policy) by removing the deprecated "LTS" and "endOfSupport" fields. We tested to ensure that no load bearing items in Astro depend upon the presence of this field. However, when actually making the change, we also updated the schema version number for the updates table. We had not realized that there is a portion of Astro code that validates that the schema version is within a certain narrow range. This validation check was no longer passed and the fallback logic was not suitable for many deployments. Specifically, the fallback logic relied on a frozen list of Astro Runtime versions from March of this year, so any versions released after that time were not properly handled. Notably, this means that all Airflow 3 based deployments were not accounted for in the fallback logic, and Airflow 3 requires significant differences in components and settings.

Because this updates document is read from the same location across all Astro deployments, including our test deployments, the change affected all of Astro simultaneously. Thus, our usual methods of testing in a lower environment did not adequately protect against a problem with this file.

We rolled this change back immediately upon becoming aware of the issue. However, this site is behind a CDN, and the changes took up to an hour to fully propagate. After the changes fully propagated, all scheduling and actual operations were functioning correctly. However, there were deployments running Airflow 3 were still reporting unhealthy. Correcting this incorrect status report required a manual change to each affected deployment. We developed a script to make this change across all Airflow 3 deployments in Astro, which then ran over the next several hours.

The actions we are taking in response to this incident are:

Changing Astro to take no action on a Deployment if it is unable to identify the version. Our previous fallback logic attempts to retreat to a "known safe" version, but the safest thing to do is simply to make no change to an existing deployment.
Change our test environments such that we are able to test the impact of changes to updates.astronomer.io without affecting customer deployments.
Because the length of this issue was extended by the caching of the content of updates.astronomer.io within Astro data planes, we are looking to create a mechanism to clear that cache from data planes on demand.

Posted Oct 01, 2025 - 23:17 UTC

Resolved

This incident has been resolved.

Posted Oct 01, 2025 - 22:17 UTC

Monitoring

The number of affected deployments has dropped significantly. Many deployments continue to inaccurately report as unhealthy, while their schedulers and other underlying components are working. We believe that all schedulers are no longer experiencing this issue, but we are verifying this.

Posted Oct 01, 2025 - 18:30 UTC

Investigating

We are currently investigating this issue.

Posted Oct 01, 2025 - 17:16 UTC

This incident affected: Astro Hosted (Scheduling and Running DAGs and Tasks, Deployment Access) and Astro Hybrid (Scheduling and Running DAGs and Tasks, Deployment Access).