Tasks from deployments with KubernetesExecutor are unable to execute

Incident Report for Astro

Postmortem

From Nov 8, 16:09Z until the issue was resolved on Nov 8, 20:21Z, Astro Hybrid Deployments using Kubernetes Executor with DAG-Only Deploy that were updated by customers were unable to start new Airflow workers. In total, there were 4 such Deployments across Astro.

As part of releasing Deploy Rollbacks, the location of the URL from which the Deployment checks for new DAGs was changed to be stored locally within the Astro Data Plane instead of requiring an API call to the Control Plane. This design is more robust to intermittent network issues, and we continue to believe this change was conceptually correct. However, Kubernetes worker pods were not updated with correct access to the new location of the URL. This prevented the worker from starting up, as retrieving this URL and then using it to download the DAGs is a critical step in initializing new workers. Once the issue was understood, an update was rolled out to all Hybrid clusters to properly give access to the new location.

This issue clearly should have been caught in testing. We reviewed our testing procedure and found a gap. On Celery Executor, a deployment procedure's impact on starting workers can be tested without running any DAGs, because Celery Executor starts a worker even when no tasks are running (unless scale to zero is on, but we disable that setting for these particular tests). We were using the same test procedure for Kubernetes Executor, but because no DAGs were running, no workers were started, and thus no errors were raised. We have now adjusted the Kubernetes Executor test suite to include actually running a DAG after testing a new deployment procedure, thus ensuring a much more realistic test.

Our existing alerting did detect these issues as they started happening to customers. We are nevertheless going to invest in more targeted alerting around the DAG Download process.

Posted Nov 28, 2023 - 03:45 UTC

Resolved

This incident has been resolved.

Posted Nov 08, 2023 - 22:39 UTC

Update

Further correction, the only affected deployments are those under Astro Hybrid, have updated in the last day, and have dag deploy enabled.

Posted Nov 08, 2023 - 19:43 UTC

Update

Update - We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day. We are continuing to monitor the results of the fix.

Posted Nov 08, 2023 - 19:21 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 08, 2023 - 19:20 UTC

Update

We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day.

Posted Nov 08, 2023 - 19:10 UTC

Identified

Due to an issue with the Kubernetes Airflow worker pod being unable to download DAGs, these pods are unable to initialize leading to the task instance to be stuck in queued.

This only affects Airflow deployments using KubernetesExecutor.

Posted Nov 08, 2023 - 16:55 UTC

This incident affected: Astro Hosted (Scheduling and Running DAGs and Tasks) and Astro Hybrid (Scheduling and Running DAGs and Tasks).