On October 1, 2025, many Astro deployments were suddenly reporting as unhealthy. We received nearly simultaneous customer reports as well as internal monitoring indicators around 12:30 EDT that something had gone wrong and promptly began our incident management process. Given the widespread nature of the issue, across deployments on multiple cloud providers and Apache Airflow versions, we looked for anything that had changed across all of Astro. There was only one change made today, approximately one hour before issues were reported. This change was to updates.astronomer.io/astronomer-runtime, which various Astronomer tools use to determine when new versions of Runtime are available and for how long they are supported.
We made a long-expected change to the format of this table to better align with our policies and practices (which are outlined at https://www.astronomer.io/docs/runtime/runtime-version-lifecycle-policy) by removing the deprecated "LTS" and "endOfSupport" fields. We tested to ensure that no load bearing items in Astro depend upon the presence of this field. However, when actually making the change, we also updated the schema version number for the updates table. We had not realized that there is a portion of Astro code that validates that the schema version is within a certain narrow range. This validation check was no longer passed and the fallback logic was not suitable for many deployments. Specifically, the fallback logic relied on a frozen list of Astro Runtime versions from March of this year, so any versions released after that time were not properly handled. Notably, this means that all Airflow 3 based deployments were not accounted for in the fallback logic, and Airflow 3 requires significant differences in components and settings.
Because this updates document is read from the same location across all Astro deployments, including our test deployments, the change affected all of Astro simultaneously. Thus, our usual methods of testing in a lower environment did not adequately protect against a problem with this file.
We rolled this change back immediately upon becoming aware of the issue. However, this site is behind a CDN, and the changes took up to an hour to fully propagate. After the changes fully propagated, all scheduling and actual operations were functioning correctly. However, there were deployments running Airflow 3 were still reporting unhealthy. Correcting this incorrect status report required a manual change to each affected deployment. We developed a script to make this change across all Airflow 3 deployments in Astro, which then ran over the next several hours.
The actions we are taking in response to this incident are: