On August 14 at 2:18PM UTC, Astronomer’s internal monitoring system began receiving sporadic, seemingly disparate alerts indicating that the scheduler component in some Astro Deployments was unhealthy.
At 3:40PM UTC, once it became clear that these alerts were related and impacting Astro users, Astronomer initiated its incident response process.
At 3:44PM UTC, the root cause of the incident was identified, and by 4:02PM UTC, the incident was resolved.
The incident was caused by a code change to the airflow_local_settings.py file that is managed by Astronomer to implement cluster policies that usually make Airflow more reliable. For example, one such cluster policy prevents users from creating KubernetesPodOperator pods or KubernetesExecutor pods from attempting to consume more resources than are available in the user’s Deployment.
This code change introduced a bug that impacted users who are using an Astro Runtime with the Python version set to less than Python 3.10. This bug was caused by implementing a feature, Python’s platform.freedesktop_os_release(), which was introduced in Python 3.10 and unavailable in Python versions less than 3.10.
To prevent this from happening again, Astronomer is now testing against all Python versions that are supported by Astro Runtime.
In order to respond quicker to potentially systematic issues impacting scheduler availability, Astronomer will raise the priority of alerts indicating schedulers are unhealthy such that the on-call support engineer will be paged.