Astro deployment schedulers crash looping due to incorrect airflow local settings config
Incident Report for Astro
Postmortem

On August 14 at 2:18PM UTC, Astronomer’s internal monitoring system began receiving sporadic, seemingly disparate alerts indicating that the scheduler component in some Astro Deployments was unhealthy.

At 3:40PM UTC, once it became clear that these alerts were related and impacting Astro users, Astronomer initiated its incident response process.

At 3:44PM UTC, the root cause of the incident was identified, and by 4:02PM UTC, the incident was resolved.

The incident was caused by a code change to the airflow_local_settings.py file that is managed by Astronomer to implement cluster policies that usually make Airflow more reliable. For example, one such cluster policy prevents users from creating KubernetesPodOperator pods or KubernetesExecutor pods from attempting to consume more resources than are available in the user’s Deployment.

This code change introduced a bug that impacted users who are using an Astro Runtime with the Python version set to less than Python 3.10. This bug was caused by implementing a feature, Python’s platform.freedesktop_os_release(), which was introduced in Python 3.10 and unavailable in Python versions less than 3.10.

To prevent this from happening again, Astronomer is now testing against all Python versions that are supported by Astro Runtime.

In order to respond quicker to potentially systematic issues impacting scheduler availability, Astronomer will raise the priority of alerts indicating schedulers are unhealthy such that the on-call support engineer will be paged.

Posted Aug 27, 2024 - 20:15 UTC

Resolved
This incident has been resolved.
Posted Aug 14, 2024 - 16:32 UTC
Monitoring
Correction, the affected Runtime images are those using versions of Python *less than* 3.10.

We have implemented a fix and are monitoring the results now.
Posted Aug 14, 2024 - 16:01 UTC
Identified
We have identified the issue, which only effects deployments using Astro Runtime with Python version <3.10. We are working on releasing a hotfix now.
Posted Aug 14, 2024 - 15:48 UTC
Investigating
Astronomer is investigating an issue with Astro deployment scheduler pods entering a CrashLoop state due to an error in the airflow local settings file. We're investigating the root cause and will update this page as we have more information.
Posted Aug 14, 2024 - 15:45 UTC
This incident affected: Astro Hosted (Scheduling and Running DAGs and Tasks) and Astro Hybrid (Scheduling and Running DAGs and Tasks).