From Nov 8, 16:09Z until the issue was resolved on Nov 8, 20:21Z, Astro Hybrid Deployments using Kubernetes Executor with DAG-Only Deploy that were updated by customers were unable to start new Airflow workers. In total, there were 4 such Deployments across Astro.
As part of releasing Deploy Rollbacks, the location of the URL from which the Deployment checks for new DAGs was changed to be stored locally within the Astro Data Plane instead of requiring an API call to the Control Plane. This design is more robust to intermittent network issues, and we continue to believe this change was conceptually correct. However, Kubernetes worker pods were not updated with correct access to the new location of the URL. This prevented the worker from starting up, as retrieving this URL and then using it to download the DAGs is a critical step in initializing new workers. Once the issue was understood, an update was rolled out to all Hybrid clusters to properly give access to the new location.
This issue clearly should have been caught in testing. We reviewed our testing procedure and found a gap. On Celery Executor, a deployment procedure's impact on starting workers can be tested without running any DAGs, because Celery Executor starts a worker even when no tasks are running (unless scale to zero is on, but we disable that setting for these particular tests). We were using the same test procedure for Kubernetes Executor, but because no DAGs were running, no workers were started, and thus no errors were raised. We have now adjusted the Kubernetes Executor test suite to include actually running a DAG after testing a new deployment procedure, thus ensuring a much more realistic test.
Our existing alerting did detect these issues as they started happening to customers. We are nevertheless going to invest in more targeted alerting around the DAG Download process.