Worker Scaling and UI Access Outage
Incident Report for Astro
Postmortem

What Happened:
Our internal monitoring tool raised an unusually high number of alerts, leading to high traffic to an internal Astronomer API. A technical inefficiency in the API caused it to scan the entire database instead of retrieving the necessary information, leading to memory issues. This in turn resulted in an outage for components reliant on this API. Among those components was one element of the startup process for Airflow workers.

Immediate Actions:
To address the issue promptly, we disabled non-critical alerts processing in the monitoring tool.

Preventive Measures:

  1. This monitoring tool has been updated to only make API calls to Astro API, which is designed to be far more robust and scalable than the API that broke and led to this outage.
  2. Additionally, a lightweight system now monitors the specific API that encountered the problem to prevent future issues.

We apologize for any inconvenience this may have caused and are committed to ensuring a more reliable experience. If you have questions or need further information, please reach out to our support team at support@astronomer.io.

Posted Nov 03, 2023 - 20:16 UTC

Resolved
Earlier today, there was an issue with Astro that prevented new workers from starting up. For customers who did deployments during this time, the lack of new workers meant that Airflow was not running any tasks. This same underlying cause also prevented some customers from viewing the Deployments page on the Astro UI. This issue is now resolved, and we will be conducting a full root cause analysis.
Posted Oct 29, 2023 - 13:00 UTC