Between March 18 and March 26, 2025, Astro experienced a series of related incidents. This write up serves as the analysis of the full set of incidents, as the events overlap.
On Tuesday, March 18, an Azure outage prevented our control plane components that handle authentication requests from reaching our Authentication vendor, Auth0. This lasted for 45 minutes and was resolved without intervention from Astronomer. Because the control plane for Astro is hosted in Azure, all customers were affected, including those that host their data planes in AWS or GCP. During this period, customers were unable to reach the Astro or Airflow UI and were unable to get a response from the Airflow API.
On Friday, March 21, our monitoring systems alerted us to a critical issue affecting the control plane authentication components, again preventing customers from accessing the Astro UI and API. Our investigation revealed the root cause was unusually high volumes of automated authentication requests that overwhelmed the scaling capabilities of our authentication components. While we cannot share specific details about this traffic for security reasons, we can confirm that our security team verified no customer data was compromised or exposed. Our engineering team implemented protective measures against similar authentication scaling issues, including reconfiguring several network settings to improve resilience.
Early Sunday, March 23, we discovered that our network reconfiguration had unintentionally triggered a previously unknown bug in our system.
The team manually restored node pools to all affected clusters by Sunday afternoon. We developed a fix for the Harmony client bug and implemented enhanced monitoring to detect any similar issues. Given the complexity of these interconnected incidents, we made the decision to implement a temporary change freeze while thoroughly reviewing the fix and our deployment procedures to avoid introducing new issues.
On Wednesday, March 26, our monitoring systems detected a recurrence of the Harmony client issue. Our incident response team immediately engaged and identified the cause: during our troubleshooting process, a change intended for our staging environment had been applied to production through a gap in our change management process. This change caused 404 errors to once again be received by the Harmony client, re-triggering the issue. Given the large number of clusters, our efforts to apply manual fixes were taking too long to address the issue. As we no longer had confidence that it would be safer to wait, we initiated a hotfix release of the patch to fix the root cause to Harmony client across the fleet.
We have corrected the underlying bug in the Harmony client and put in initial steps to deal with bursts of unusual traffic. In the coming weeks, we plan to put further mitigations in place around automated traffic to further protect the system. We are looking at additional process measures to further separate staging and production access. We have also improved our monitoring to detect similar issues and are discussing further monitoring improvements. Additionally, we will deploy a script for automated recovery of a set of clusters to improve remediation time.
If you have any additional questions or concerns, please contact us at support@astronomer.io.