Astro Status - Incident History

Astro does not respect .airflowignore when deploying to Astro with Astro CLI 1.28.0

2024-07-25T15:24:03Z

Jul 25, 15:24 UTC
Resolved - This incident has been resolved. Please upgrade to Astro CLI version 1.28.1.

Jul 25, 14:42 UTC
Identified - The issue has been identified and a fix is being implemented.

All KubernetesExecutor tasks failing for some customers using DAG-only deploy

2024-07-03T19:19:29Z

Jul 3, 19:19 UTC
Resolved - This incident has been resolved.

Jul 3, 19:04 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Jul 3, 18:12 UTC
Identified - The issue has been identified and a fix is being implemented.

Jul 3, 18:12 UTC
Update - We are continuing to investigate this issue.

Jul 3, 17:48 UTC
Investigating - We are currently investigating this issue.

Update Deployment feature impacted on Hybrid GCP Clusters

2024-06-17T15:17:16Z

Jun 17, 15:17 UTC
Resolved - This incident has been resolved.

Jun 17, 11:51 UTC
Identified - The issue has been identified and a fix is being implemented.

Jun 17, 09:19 UTC
Investigating - We are currently experiencing an issue where customers hosted on Hybrid GCP Clusters are unable to update their deployments. Our engineering team is actively investigating the cause of this problem and working to resolve it as quickly as possible.

Please note the existing dags are running fine however updates to deployment resources(variables, worker queues, etc) are blocked.

Astronomer is currently investigating an issue preventing customers from deploying to Astro

2024-05-24T15:50:53Z

May 24, 15:50 UTC
Resolved - This incident has been resolved.

May 24, 15:25 UTC
Monitoring - The issue has been mitigated, and we are continuing to monitor.

May 24, 15:09 UTC
Investigating - We are currently investigating this issue.

Airflow API 500 Errors

2024-05-22T18:53:09Z

May 22, 18:53 UTC
Resolved - The incident has been resolved. During this time, DAGs that were making API calls to Airflow deployments may have failed - those tasks can now be reran.

May 22, 18:29 UTC
Investigating - When accessing the Airflow API programmatically, requests may fail and return an HTTP 500 status code. We are investigating the cause of this issue.

Access through the UI is unaffected and all DAGs will continue to run as normal.

GCP outage causing scaling issues for GCP clusters

2024-05-17T04:10:43Z

May 17, 04:10 UTC
Resolved - The service disruption caused by the Google Cloud Platform (GCP) outage has been resolved. All affected services have been restored to normal operation. Please refer to the GCP status page for more details on the incident.

https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

May 17, 00:27 UTC
Investigating - Fairly widespread GCP outage is preventing new Astro nodes in GCP from pulling images. This should not affect existing nodes and shouldn't affect running DAGs unless they need to scale

https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

Authentication Errors

2024-04-17T18:59:57Z

Apr 17, 18:59 UTC
Resolved - This incident has been resolved.

Apr 17, 15:46 UTC
Monitoring - We have received notification from our authentication provider that this is now resolved.

If you are still unable to login to Astro, please refresh your browser and try again.

Apr 17, 15:33 UTC
Identified - We are experiencing a failure with our upstream authentication provider that is causing users to receive a 404 error when trying to login to the Astro platform.

We are currently working with the provider to resolve the issue.

This issue does not affect DAGs from running.

Routine Control Plane Maintenance

2024-04-16T04:00:03Z

Apr 16, 04:00 UTC
Completed - The scheduled maintenance has been completed.

Apr 16, 02:00 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.

Apr 12, 19:16 UTC
Scheduled - We will be performing routine maintenance on the Astro Control Plane during this period. We do not expect any noticeable impact or downtime.

Worker Nodes Not Spinning Up in GCP Dataplane Clusters

2024-04-04T04:52:36Z

Apr 4, 04:52 UTC
Resolved - This incident has been resolved.

Apr 4, 04:51 UTC
Update - We are continuing to monitor for any further issues.

Apr 4, 03:56 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Apr 4, 03:32 UTC
Identified - The issue has been identified and the fix is being implemented.

Apr 4, 03:27 UTC
Investigating - Incident Description: Some worker nodes within several GCP dataplane clusters are failing to spin up as expected. This issue is causing delays in task execution and may lead to DAGs/tasks getting stuck in the queued state or failing.

Current Status: We have pinpointed the issue and confirmed its existence. Our engineering team is actively collaborating to resolve the problem.

Impact: Delays in task execution within affected clusters. There is a risk of DAGs/tasks getting stuck in the queued state or failing due to the inability to spin up worker nodes.

Resolution: Our engineering team is working diligently to implement a fix for this issue.

Communication: Regular updates will be provided to keep you informed of any developments.

We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve this issue promptly. Please stay tuned for further updates.

Deployment metrics sometimes failing to load

2024-03-27T02:57:16Z

Mar 27, 02:57 UTC
Resolved - This incident has been resolved.

Mar 27, 02:42 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Mar 26, 23:19 UTC
Identified - Listing or viewing some deployments will display "metrics failed to load" instead of showing Dag Run and Task Instance summaries. Actual DAG Runs and tasks are continuing to execute correctly, and the Airflow UI is still accessible. We have identified the problem and are working on deploying a fix.

Quay.io image registry is having an outage

2024-03-22T18:55:17Z

Mar 22, 18:55 UTC
Resolved - This incident has been resolved.

Mar 22, 18:42 UTC
Monitoring - Quay.io appears to be back up

Mar 22, 17:31 UTC
Investigating - This outage affects:
* New cluster creation
* CI/CD pipelines which pull public images (e.g. Astro runtime) from Quay
* Provisioning new worker pods & nodes (scale up) for some clusters

Astronomer Cloud UI and API Unavailable

2024-03-16T14:57:34Z

Mar 16, 14:57 UTC
Resolved - We have identified the issue and a mitigation was applied. Services have resumed healthy operation. This issue is now resolved.

Mar 16, 12:43 UTC
Investigating - We are currently investigating an issue with the Astronomer Cloud UI and API. Please standby for further updates.

Astro CLI versions <= 1.22 are unable to successfully execute some commands

2024-03-06T22:43:43Z

Mar 6, 22:43 UTC
Resolved - The fix has been made and the issue is now resolved.

Mar 6, 17:42 UTC
Identified - Upgrading the Astro CLI to 1.24.1 is known to fix this issue.

A change to backend systems broke some functionality of the Astro CLI, including the ability to deploy code to Astro. We've identified the issue and are working to implement a fix.

Unable to update deployment from Astro UI

2024-02-28T23:30:31Z

Feb 28, 23:30 UTC
Resolved - Issues has been resolved!

Feb 28, 22:43 UTC
Monitoring - The issue has been fixed and deployment updates are now working from the UI. We will continue to monitor.

Feb 28, 22:07 UTC
Identified - Updates to non-development deployments via the Astro UI may be declined with an invalid request error. A fix is being worked on.

Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments

2024-02-07T00:27:23Z

Feb 7, 00:27 UTC
Resolved - This incident has been resolved.

Feb 6, 23:59 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Feb 6, 23:08 UTC
Investigating - Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.

Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments

2024-02-06T06:24:57Z

Feb 6, 06:24 UTC
Resolved - This incident has been resolved.

Feb 6, 03:59 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Feb 6, 02:56 UTC
Investigating - Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.

Astro Analytics - Degraded Performance

2024-01-26T00:29:02Z

Jan 26, 00:29 UTC
Resolved - This incident has been resolved.

Astronomer builds metrics in part by using a logging tool. The performance of the logging tool’s indexer was adversely impacted by an increase in scheduled queries, which overwhelmed the logging tool, resulting in a backup of queries, which in turn impacted the monitors in the Astro UI.

After optimizing scheduled queries, performance returned to normal.

Jan 26, 19:14 UTC
Investigating - Our team is currently investigating the degraded performance of Astro analytics service.

Astro Cloud UI "No Healthy Upstream"

2024-01-22T14:01:17Z

Jan 22, 14:01 UTC
Resolved - Incident has been resolved, all systems operational.

Jan 22, 12:43 UTC
Monitoring - The issue was identified, and a fix has been applied. We are currently monitoring the deployments.

Jan 22, 12:03 UTC
Update - The impact reassessed as major for expedited mitigation.

Jan 22, 12:01 UTC
Investigating - We are currently investigating this issue.

Hybrid customers are unable to create or update deployments

2024-01-17T23:00:00Z

Jan 17, 23:00 UTC
Resolved - Creating and updating deployments will result in failure which could manifest as Internal server errors (status code 500).

New worker pods in Azure AKS clusters unable to start

2023-12-21T21:55:43Z

Dec 21, 21:55 UTC
Resolved - This incident has been resolved.

Dec 21, 20:49 UTC
Monitoring - The issue has been identified and we are beginning to update the affected clusters. Worker pods that were stuck in Pending state are spinning up now.

Dec 21, 20:30 UTC
Investigating - We are aware of an issue with Azure and are currently investigating it.
Pods older than 1:30 PM CST (0630 UTC) are not affected.

Monitoring service in Astro Standard Clusters experiencing issues

2023-12-15T02:37:47Z

Dec 15, 02:37 UTC
Resolved - This incident has been resolved.

Dec 14, 23:14 UTC
Monitoring - Hotfix has been deployed to prod, affected clusters are being bootstrapped with the hotfix. We are monitoring the results of the fix.

Dec 14, 22:21 UTC
Update - Hotfix has been released to stage, we are validating the results and will proceed with the release to prod following that validation.

Dec 14, 21:12 UTC
Identified - The issue has been identified and a hotfix is being created and rolled out.

Dec 14, 20:31 UTC
Investigating - We are currently investigating an issue in the monitoring service Astronomer uses to monitor Astro Standard Clusters.

Bug in AstroAPI endpoint call deleting connections from Astro Environment

2023-12-09T01:13:01Z

Dec 9, 01:13 UTC
Resolved - This incident has been resolved.

Dec 8, 21:52 UTC
Identified - A bug has been identified in the Managed Connections of Astro Hosted Environments that deletes existing connections. A fix has been made and is being deployed.

Hybrid customers Unable to view Teams

2023-12-08T05:59:30Z

Dec 8, 05:59 UTC
Resolved - This incident has been resolved.

Dec 8, 05:52 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Dec 8, 05:51 UTC
Update - We are continuing to investigate this issue.

Dec 8, 04:32 UTC
Investigating - We are currently investigating this issue.

Modifying environment variables from the Astro UI may delete the values for other environment variables marked as "Secret"

2023-12-07T01:52:38Z

Dec 7, 01:52 UTC
Resolved - This incident has been resolved.

Dec 6, 19:30 UTC
Investigating - We are currently investigating this issue.

If you rely on setting environment variables in the Astro UI, please refrain from updating environment variables at this time.

Quay.io outage causing new pods to be stuck in Pending waiting to download container images

2023-11-15T19:09:45Z

Nov 15, 19:09 UTC
Resolved - This incident has been resolved.

Nov 15, 01:41 UTC
Monitoring - Quay.io has indicated that they have completed the fix and is operating correctly for pushes and pulls.

New Airflow services that come up are operational. We are continuing to monitor the situation as Quay.io has not marked their incident as resolved.

Nov 14, 22:19 UTC
Update - This issue is ongoing. We have observed some instances where images are able to be pulled, but we're continuing to observe widespread image pull issues.

We will update as more information becomes available.

Nov 14, 21:01 UTC
Update - Quay has indicated that they are continuing to experience instability and are moving their image repo to read-only mode, which will affect image push operations.

Nov 14, 20:53 UTC
Identified - Quay.io, the container image repository used by Astronomer is experiencing issues with image pull failures.

Quay.io incident: https://status.quay.io/incidents/z7sbjqmb34p1

We will continue monitoring the situation and update this incident as more information becomes available.

Existing pods should be unaffected and will continue executing tasks.