Astro Cloud UI "No Healthy Upstream"

Incident Report for Astro

Postmortem

Problem

Early on the morning of January 22nd, customers faced issues accessing the Cloud UI and performing certain operations with the Astro CLI. This was an intermittent outage which spanned from 3:25 AM to 4:07 AM PST. This intermittent outage resulted in unavailability of the Cloud UI for some customers, prevented Astro CLI deploy commands from running successfully, and prevented successful node scale-ups in Astro data planes. We don’t believe that this outage caused any tasks to fail, but it might have slowed task scheduling for some customers.

For Deployments using the Celery executor on Astro Hybrid, workers could not properly scale up during the outage periods.
For Deployments using the Kubernetes executor or Deployments using Celery Executor on Astro Hosted, new worker pods could start, but new nodes could not be added, so the autoscaling would limit itself to the nodes available at the start of the outage.

The component that caused the intermittent outage is called the Astro API, which is a critical service in the control plane that mediates most actions in Astro. However, because Airflow itself doesn’t use this component, losing the Astro API prevents most Astro functions from working but doesn’t directly impact the operation of running Airflow Deployments.

The problem was triggered when an end user configured the “Linked to all Deployments”option in the Airflow connection management menu for their Astro Workspace. This end user action resulted in a segmentation fault that bypassed middleware designed to trap and recover from the segmentation faults, killing the Astro API container and taking down one of the running replicas. Because a typical user action is to retry the operation upon failure, it’s possible that the end user kept retrying this while the back-end system was bringing up new Astro API containers to replace the container which had just crashed, thereby triggering a degradation of service and errors to users trying to access the service at the same time.

One of these customers that encountered an error reported the problem to Astronomer Support almost simultaneously to our internal alert which detected the Astro API containers crashing. This was immediately escalated by the Support team to Engineering via our Incident Management process and a mitigation was put in place within 40 minutes of the escalation.

‌

Root Cause

We introduced code in the release we rolled out on January 17 that has now been determined to be thread unsafe. Ironically, this code was written to reduce duplications in our code and make it modular and testable and reduce the risk of introducing bugs.

We deployed logic in a specific code path that would read from the database using transactions. However, in golang (and the ORM library we use), the concurrent transaction reads were actually using different connections and hence thread unsafe. This led to the database calls not being able to retrieve anything but not returning any errors either, which further led to a nil pointer dereference panic on Astro API pods causing them to restart.

We also now know the user scenario that executes this thread unsafe code.

A user goes to a Workspace with more than 1 Deployment in it.
The user opens the Environment tab for that workspace.
The user switches the “Linked to all Deployments” option from true to false OR from false to true OR adds/updates a connection.

We believe that the switching “Linked to all Deployments” option in the UI is such a spammable operation that a single user who was trying that operation could have done it a few times in the UI which would have led to the individual pods handling those requests in the backend to panic and crash, thereby resulting in a degradation of the service.

The initial mitigation fix was rolled out within 40 minutes so that, instead of panicking in those situations, Astro API would return HTTP 500 errors. The individual problem requests still failed but did not cause the pods to restart.

We also released a medium term fix on Jan 24, wherein we changed the concurrent db read with serial (less optimal) reads that loop through all Deployments in the Workspace instead. This change has been validated to have fixed the problem, which means no more panics nor 500 errors for the users changing the auto-link option.

What We’re Doing to Prevent this from Recurring

We are focusing on what can be done to improve the robustness of the Astro API given its key role in the Astro control plane, as well as how to have a faster response and improved monitoring for issues that arise.

The first alert that was raised for this incident was a pod restart alert for the Astro API pods. This alert however was not tuned correctly in two ways. Firstly, it auto-resolved because of an incorrect setting. Secondly, and more critically, receiving this alert does not always mean that there is customer impact. Because of this and the auto-resolution, the engineer who was paged mistakenly believed that this issue was momentary and would not have customer impact. To prevent this, we are adding two new alerts based on the dashboards we used to determine the amount of impact during the outage. These alerts will measure both the real customer generated traffic to the Astro API and our own synthetic traffic (in case the real traffic is blocked by an ingress issue) to raise high priority alarms to both our Support and Engineering teams if the Astro API drops below very high levels of consistency. Because these alerts will always indicate an important customer-visible problem and go to multiple teams within Astronomer, we are confident that they will be acted on with the appropriate urgency.

Our postmortem review also revealed that with a more robust Astro API pod setup, even issues of this magnitude would have much less customer impact. The problem was that the minimum size of the Astro API autoscaling group is too small, and it was at that minimum size during the outage. We currently run a small number of large pods, and we are now looking to have a significantly larger number of smaller pods instead. With a larger number of smaller pods, the panics would have been less likely to crash all of the pods at once. There are some nuances to work out about how to size and manage database connections with a larger number of pods, so this change will not be rolled out until we can be sure it will not have other unintended negative consequences.

We have analyzed the specific bug that triggered this outage, and we don’t believe that we could have reasonably implemented a regression test for the behavior that would have detected this. Without the deep understanding that the database connections would later be made concurrent and accidentally thread-unsafe, we would not have been able to predict which tests would be required.

We also evaluated the feasibility of enabling a quick rollback so that we would not have to determine the full root cause and fix to resolve the outage. However, although we are capable of doing a rollback in our deployment model, it would not have been feasible to do this. This is because the change that included the bug was committed over a week before the outage and deployed five days before the outage. We have a weekly release cadence in the control plane, but even if we went to smaller and more frequent releases, this change did not cause a problem immediately; it had to be triggered by a specific user action. Because other updates since this change involve database schema updates and other infrastructure changes, it is not obviously safe to roll back to before the bug was deployed.

Posted Jan 22, 2024 - 15:45 UTC

Resolved

Incident has been resolved, all systems operational.

Posted Jan 22, 2024 - 14:01 UTC

Monitoring

The issue was identified, and a fix has been applied. We are currently monitoring the deployments.

Posted Jan 22, 2024 - 12:43 UTC

Update

The impact reassessed as major for expedited mitigation.

Posted Jan 22, 2024 - 12:03 UTC

Investigating

We are currently investigating this issue.

Posted Jan 22, 2024 - 12:01 UTC

This incident affected: Astro Hybrid (Deployment Access, Deployment Management, Cloud UI, Cloud API) and Astro Hosted (Deployment Access, Deployment Management, Cloud UI, Cloud API).