Certain Astro clusters on AWS experiencing downtime

Incident Report for Astro

Resolved

We have determined the event that caused this downtime and we are confident that it will not occur again. We will post a public RCA in the coming week.

Posted Mar 23, 2025 - 18:27 UTC

Monitoring

We have applied a remediation for all of the affected clusters. No clusters are currently experiencing downtime. We are continuing to examine the root cause and will update again when we are confident that the issue will not recur.

Posted Mar 23, 2025 - 15:44 UTC

Identified

We have identified a problem with scaling behavior that is causing a limited number of clusters to experience downtime. The message 'Internal Server Error' displays on the UI preventing the viewing of DAGs and the Airflow UI. This is in some cases affecting task execution. We are working on a fix currently.

Posted Mar 23, 2025 - 14:51 UTC

This incident affected: Astro Hosted (Scheduling and Running DAGs and Tasks, Deployment Access, Cloud UI, Cloud API).