Astro UI and Worker Scaling Impacted by Authentication Provider Outage

Incident Report for Astro

Postmortem

Astro was impacted by an outage of our authentication provider. As expected, UI across Astro became unreachable, as our system was not able to determine who was a properly credentialed user. We have reviewed if there is any way that we could have maintained UI access during the authentication outage, but have concluded that any such change would be an unacceptable compromise to our security.

However, in addition to an inaccessible UI, the authentication provider outage also caused degraded performance of scaling workers to perform task runs. Certain Astro components involved in auto-scaling have verification steps that currently make calls to our authentication service, and failed when these calls could not be completed. However, we have determined alternate, secure methods of performing these verification steps that do not require new calls to be made to Astro’s authentication provider during a scale up. We will be making changes to Astro to implement these improved methods, and we expect them to be available by the end of 2023.

At Astronomer, our top priority is the reliable and secure execution of your DAGs, and we believe with these changes, customer Deployments will have the same high level of security as they do currently with an increased resilience to any future outages in authentication.

Posted Nov 03, 2023 - 20:31 UTC

Resolved

Normal performance has returned to all Astro components.

Posted Oct 30, 2023 - 22:31 UTC

Update

We have seen some slowness in the UI as Astro recovers.

We have identified that the reason for task failures was that on Deployments with KubernetesExecutor and dag-only-deployments enabled, worker pods could not start up successfully and marked their tasks as failed. Similarly, CeleryExecutor Deployments on Runtime 8+ mark tasks as failed after being queued for 10 minutes, and so the slow execution speed from degraded auto-scaling caused some tasks to be marked failed.

Posted Oct 30, 2023 - 20:48 UTC

Monitoring

The underlying provider outage appears to have ended. Access to Astro is now restored and appears to be functioning normally again. We did see that more tasks than usual failed during this period. We are continuing to investigate why these failures occurred. Please review your recent task runs as you may need to clear and restart tasks.

Posted Oct 30, 2023 - 20:32 UTC

Investigating

The authentication provider for Astro appears to be having a major outage and this is preventing access to Astro. This will not directly stop the successful execution of tasks, but it will impact auto scaling of workers, which could lead to much slower rates of task execution.

Posted Oct 30, 2023 - 20:17 UTC

This incident affected: Astro Hosted (Scheduling and Running DAGs and Tasks, Deployment Access, Deployment Management, Cloud UI, Cloud API) and Astro Hybrid (Scheduling and Running DAGs and Tasks, Deployment Access, Deployment Management, Cloud UI, Cloud API, Cluster Management).