Image based deploys are unavailable
Incident Report for Astro
Postmortem

On November 7th, around 3:00 AM PST customers were unable to push new deployment images when using API tokens or API keys and this lasted for 19 hours. The incident in question was triggered by a release containing code that did not account for authentication methods, occurring when a new feature, deploy rollbacks, was merged and activated in the production environment. Customers who did not upgrade their CLI could use the deploy command without error, but those with upgraded versions encountered issues, so we mitigated the errors by unpinning the failing CLI version and disabling a feature flag before implementing a full fix.‌

The root cause of the issue was traced back to a specific piece of code that handled authorization requests for our Docker registry. We recently removed a proxy that was sitting in front of the registry that was used to authenticate requests and instead configured the registry to perform auth checks itself.

However, the newly implemented deploy rollback code failed to accommodate authorization requests using API tokens or API keys, leading to the malfunction. This failure was primarily an oversight in addressing the implementation details necessary for this use case.

Two major feature flows were merged in the same release cycle, both heavily reliant on each other - the Docker registry authentication redesign in our change in k8s infrastructure that exposes our docker registry service and the redeploy feature in Astro core. Such merging of coupled, large feature sets is inherently risky. In this particular case, the release cadence could have been spread out to mitigate the risks.

The combination of the release of multiple components (CLI and Airflow deployment management), and the lack of integration testing between different combinations of versions of these components meant that the failing scenario was not replicated in testing. This gap in testing has been identified, and we are working towards testing matrices with all in-use versions of Astro components.

Posted Nov 17, 2023 - 17:30 UTC

Resolved
This incident has been resolved.
Posted Nov 08, 2023 - 19:15 UTC
Monitoring
A fix has been implemented and we are monitoring the results
Posted Nov 08, 2023 - 15:35 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 08, 2023 - 14:12 UTC
Investigating
Deploying images from astro-cli seems to be resulting in:
`Error: Invalid request: Could not get docker registry token for request: Internal server error`
Posted Nov 08, 2023 - 11:29 UTC
This incident affected: Astro Hybrid (Deployment Management) and Astro Hosted (Deployment Management).