tag:status.astronomer.io,2005:/historyAstro Status - Incident History2024-03-28T15:47:47ZAstrotag:status.astronomer.io,2005:Incident/203674142024-03-27T02:57:16Z2024-03-27T02:57:16ZDeployment metrics sometimes failing to load<p><small>Mar <var data-var='date'>27</var>, <var data-var='time'>02:57</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>27</var>, <var data-var='time'>02:42</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Mar <var data-var='date'>26</var>, <var data-var='time'>23:19</var> UTC</small><br><strong>Identified</strong> - Listing or viewing some deployments will display "metrics failed to load" instead of showing Dag Run and Task Instance summaries. Actual DAG Runs and tasks are continuing to execute correctly, and the Airflow UI is still accessible. We have identified the problem and are working on deploying a fix.</p>tag:status.astronomer.io,2005:Incident/203304932024-03-22T18:55:17Z2024-03-22T18:55:17ZQuay.io image registry is having an outage<p><small>Mar <var data-var='date'>22</var>, <var data-var='time'>18:55</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>22</var>, <var data-var='time'>18:42</var> UTC</small><br><strong>Monitoring</strong> - Quay.io appears to be back up</p><p><small>Mar <var data-var='date'>22</var>, <var data-var='time'>17:31</var> UTC</small><br><strong>Investigating</strong> - This outage affects:<br />* New cluster creation<br />* CI/CD pipelines which pull public images (e.g. Astro runtime) from Quay<br />* Provisioning new worker pods & nodes (scale up) for some clusters</p>tag:status.astronomer.io,2005:Incident/202774042024-03-16T14:57:34Z2024-03-16T14:57:34ZAstronomer Cloud UI and API Unavailable<p><small>Mar <var data-var='date'>16</var>, <var data-var='time'>14:57</var> UTC</small><br><strong>Resolved</strong> - We have identified the issue and a mitigation was applied. Services have resumed healthy operation. This issue is now resolved.</p><p><small>Mar <var data-var='date'>16</var>, <var data-var='time'>12:43</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating an issue with the Astronomer Cloud UI and API. Please standby for further updates.</p>tag:status.astronomer.io,2005:Incident/201780152024-03-06T22:43:43Z2024-03-06T22:43:43ZAstro CLI versions <= 1.22 are unable to successfully execute some commands<p><small>Mar <var data-var='date'> 6</var>, <var data-var='time'>22:43</var> UTC</small><br><strong>Resolved</strong> - The fix has been made and the issue is now resolved.</p><p><small>Mar <var data-var='date'> 6</var>, <var data-var='time'>17:42</var> UTC</small><br><strong>Identified</strong> - Upgrading the Astro CLI to 1.24.1 is known to fix this issue.<br /><br />A change to backend systems broke some functionality of the Astro CLI, including the ability to deploy code to Astro. We've identified the issue and are working to implement a fix.</p>tag:status.astronomer.io,2005:Incident/201012282024-02-28T23:30:31Z2024-02-28T23:30:31ZUnable to update deployment from Astro UI<p><small>Feb <var data-var='date'>28</var>, <var data-var='time'>23:30</var> UTC</small><br><strong>Resolved</strong> - Issues has been resolved!</p><p><small>Feb <var data-var='date'>28</var>, <var data-var='time'>22:43</var> UTC</small><br><strong>Monitoring</strong> - The issue has been fixed and deployment updates are now working from the UI. We will continue to monitor.</p><p><small>Feb <var data-var='date'>28</var>, <var data-var='time'>22:07</var> UTC</small><br><strong>Identified</strong> - Updates to non-development deployments via the Astro UI may be declined with an invalid request error. A fix is being worked on.</p>tag:status.astronomer.io,2005:Incident/199259282024-02-07T00:27:23Z2024-02-07T00:27:23ZIntermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments<p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>00:27</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>23:59</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>23:08</var> UTC</small><br><strong>Investigating</strong> - Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.</p>tag:status.astronomer.io,2005:Incident/199180242024-02-06T06:24:57Z2024-02-06T06:24:57ZIntermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments<p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>06:24</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>03:59</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>02:56</var> UTC</small><br><strong>Investigating</strong> - Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.</p>tag:status.astronomer.io,2005:Incident/198326632024-01-26T00:29:02Z2024-01-26T00:29:03ZAstro Analytics - Degraded Performance<p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>00:29</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.<br /><br />Astronomer builds metrics in part by using a logging tool. The performance of the logging tool’s indexer was adversely impacted by an increase in scheduled queries, which overwhelmed the logging tool, resulting in a backup of queries, which in turn impacted the monitors in the Astro UI.<br /><br />After optimizing scheduled queries, performance returned to normal.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:14</var> UTC</small><br><strong>Investigating</strong> - Our team is currently investigating the degraded performance of Astro analytics service.</p>tag:status.astronomer.io,2005:Incident/197933552024-01-22T14:01:17Z2024-01-26T21:42:38ZAstro Cloud UI "No Healthy Upstream"<p><small>Jan <var data-var='date'>22</var>, <var data-var='time'>14:01</var> UTC</small><br><strong>Resolved</strong> - Incident has been resolved, all systems operational.</p><p><small>Jan <var data-var='date'>22</var>, <var data-var='time'>12:43</var> UTC</small><br><strong>Monitoring</strong> - The issue was identified, and a fix has been applied. We are currently monitoring the deployments.</p><p><small>Jan <var data-var='date'>22</var>, <var data-var='time'>12:03</var> UTC</small><br><strong>Update</strong> - The impact reassessed as major for expedited mitigation.</p><p><small>Jan <var data-var='date'>22</var>, <var data-var='time'>12:01</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.astronomer.io,2005:Incident/197450712024-01-17T23:00:00Z2024-01-30T19:55:13ZHybrid customers are unable to create or update deployments<p><small>Jan <var data-var='date'>17</var>, <var data-var='time'>23:00</var> UTC</small><br><strong>Resolved</strong> - Creating and updating deployments will result in failure which could manifest as Internal server errors (status code 500).</p>tag:status.astronomer.io,2005:Incident/195009912023-12-21T21:55:43Z2023-12-21T21:55:43ZNew worker pods in Azure AKS clusters unable to start<p><small>Dec <var data-var='date'>21</var>, <var data-var='time'>21:55</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'>21</var>, <var data-var='time'>20:49</var> UTC</small><br><strong>Monitoring</strong> - The issue has been identified and we are beginning to update the affected clusters. Worker pods that were stuck in Pending state are spinning up now.</p><p><small>Dec <var data-var='date'>21</var>, <var data-var='time'>20:30</var> UTC</small><br><strong>Investigating</strong> - We are aware of an issue with Azure and are currently investigating it.<br />Pods older than 1:30 PM CST (0630 UTC) are not affected.</p>tag:status.astronomer.io,2005:Incident/194420912023-12-15T02:37:47Z2023-12-15T02:37:47ZMonitoring service in Astro Standard Clusters experiencing issues<p><small>Dec <var data-var='date'>15</var>, <var data-var='time'>02:37</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'>14</var>, <var data-var='time'>23:14</var> UTC</small><br><strong>Monitoring</strong> - Hotfix has been deployed to prod, affected clusters are being bootstrapped with the hotfix. We are monitoring the results of the fix.</p><p><small>Dec <var data-var='date'>14</var>, <var data-var='time'>22:21</var> UTC</small><br><strong>Update</strong> - Hotfix has been released to stage, we are validating the results and will proceed with the release to prod following that validation.</p><p><small>Dec <var data-var='date'>14</var>, <var data-var='time'>21:12</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and a hotfix is being created and rolled out.</p><p><small>Dec <var data-var='date'>14</var>, <var data-var='time'>20:31</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating an issue in the monitoring service Astronomer uses to monitor Astro Standard Clusters.</p>tag:status.astronomer.io,2005:Incident/193808552023-12-09T01:13:01Z2023-12-09T01:13:01ZBug in AstroAPI endpoint call deleting connections from Astro Environment<p><small>Dec <var data-var='date'> 9</var>, <var data-var='time'>01:13</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'> 8</var>, <var data-var='time'>21:52</var> UTC</small><br><strong>Identified</strong> - A bug has been identified in the Managed Connections of Astro Hosted Environments that deletes existing connections. A fix has been made and is being deployed.</p>tag:status.astronomer.io,2005:Incident/193729972023-12-08T05:59:30Z2023-12-08T05:59:30ZHybrid customers Unable to view Teams<p><small>Dec <var data-var='date'> 8</var>, <var data-var='time'>05:59</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'> 8</var>, <var data-var='time'>05:52</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Dec <var data-var='date'> 8</var>, <var data-var='time'>05:51</var> UTC</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Dec <var data-var='date'> 8</var>, <var data-var='time'>04:32</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.astronomer.io,2005:Incident/193566622023-12-07T01:52:38Z2023-12-14T16:50:02ZModifying environment variables from the Astro UI may delete the values for other environment variables marked as "Secret"<p><small>Dec <var data-var='date'> 7</var>, <var data-var='time'>01:52</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'> 6</var>, <var data-var='time'>19:30</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.<br /><br />If you rely on setting environment variables in the Astro UI, please refrain from updating environment variables at this time.</p>tag:status.astronomer.io,2005:Incident/191215252023-11-15T19:09:45Z2023-11-30T07:20:52ZQuay.io outage causing new pods to be stuck in Pending waiting to download container images<p><small>Nov <var data-var='date'>15</var>, <var data-var='time'>19:09</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'>15</var>, <var data-var='time'>01:41</var> UTC</small><br><strong>Monitoring</strong> - Quay.io has indicated that they have completed the fix and is operating correctly for pushes and pulls.<br /><br />New Airflow services that come up are operational. We are continuing to monitor the situation as Quay.io has not marked their incident as resolved.</p><p><small>Nov <var data-var='date'>14</var>, <var data-var='time'>22:19</var> UTC</small><br><strong>Update</strong> - This issue is ongoing. We have observed some instances where images are able to be pulled, but we're continuing to observe widespread image pull issues.<br /><br />We will update as more information becomes available.</p><p><small>Nov <var data-var='date'>14</var>, <var data-var='time'>21:01</var> UTC</small><br><strong>Update</strong> - Quay has indicated that they are continuing to experience instability and are moving their image repo to read-only mode, which will affect image push operations.</p><p><small>Nov <var data-var='date'>14</var>, <var data-var='time'>20:53</var> UTC</small><br><strong>Identified</strong> - Quay.io, the container image repository used by Astronomer is experiencing issues with image pull failures.<br /><br />Quay.io incident: https://status.quay.io/incidents/z7sbjqmb34p1<br /><br />We will continue monitoring the situation and update this incident as more information becomes available.<br /><br />Existing pods should be unaffected and will continue executing tasks.</p>tag:status.astronomer.io,2005:Incident/190600342023-11-08T22:39:56Z2023-11-28T03:45:11ZTasks from deployments with KubernetesExecutor are unable to execute<p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>22:39</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:43</var> UTC</small><br><strong>Update</strong> - Further correction, the only affected deployments are those under Astro Hybrid, have updated in the last day, and have dag deploy enabled.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:21</var> UTC</small><br><strong>Update</strong> - Update - We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day. We are continuing to monitor the results of the fix.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:20</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:10</var> UTC</small><br><strong>Update</strong> - We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>16:55</var> UTC</small><br><strong>Identified</strong> - Due to an issue with the Kubernetes Airflow worker pod being unable to download DAGs, these pods are unable to initialize leading to the task instance to be stuck in queued. <br /><br />This only affects Airflow deployments using KubernetesExecutor.</p>tag:status.astronomer.io,2005:Incident/190616352023-11-08T22:39:08Z2023-11-08T22:39:08ZAstro CLI cannot access Airflow variables and connections<p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>22:39</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>20:11</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:19</var> UTC</small><br><strong>Identified</strong> - When using the Astro CLI to access deployments variables and connections, you may receive the following error failed to decode response from API. If modifying the variables or connections and you receive this message, the modifications have not taken effect.<br /><br />Examples of CLI commands that may fail are:<br />- astro deployment airflow-variable list<br />- astro deployment connection list<br />- astro deployment connection update<br /><br />Our team has identified the issue and is releasing a fix.</p>tag:status.astronomer.io,2005:Incident/190558312023-11-08T19:15:05Z2023-11-17T17:32:42ZImage based deploys are unavailable<p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>19:15</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>15:35</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>14:12</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Nov <var data-var='date'> 8</var>, <var data-var='time'>11:29</var> UTC</small><br><strong>Investigating</strong> - Deploying images from astro-cli seems to be resulting in:<br />`Error: Invalid request: Could not get docker registry token for request: Internal server error`</p>tag:status.astronomer.io,2005:Incident/190491492023-11-07T20:00:34Z2023-11-07T20:00:34ZSome Astro Hybrid Worker Pods failing to start and image-based deploys are unavailable<p><small>Nov <var data-var='date'> 7</var>, <var data-var='time'>20:00</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'> 7</var>, <var data-var='time'>19:51</var> UTC</small><br><strong>Update</strong> - We are continuing to monitor for any further issues.</p><p><small>Nov <var data-var='date'> 7</var>, <var data-var='time'>19:44</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results. All systems are believed to be fully operational at this time.</p><p><small>Nov <var data-var='date'> 7</var>, <var data-var='time'>19:21</var> UTC</small><br><strong>Identified</strong> - We have identified the issue and are working to implement a fix.<br /><br />New tasks may get stuck in queued until the fix is implemented, and image deploys will not work</p><p><small>Nov <var data-var='date'> 7</var>, <var data-var='time'>19:18</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.astronomer.io,2005:Incident/190036622023-11-03T05:01:24Z2023-11-03T05:01:24ZVault Maintenance (Delayed)<p><small>Nov <var data-var='date'> 3</var>, <var data-var='time'>05:01</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Nov <var data-var='date'> 3</var>, <var data-var='time'>03:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>20:30</var> UTC</small><br><strong>Scheduled</strong> - Update: The maintenance previously scheduled for yesterday was delayed to today.<br /><br />We will be updating our secrets backend. During this process we are making certain actions unavailable in order to ensure a smooth transition. During the maintenance window, it will not be possible to create new or update existing Clusters, Deployments, API keys, or Cloud IDE projects. DAGs will continue to function and you will not lose access to the UI to view progress or logs of your tasks.</p>tag:status.astronomer.io,2005:Incident/188704772023-11-02T05:00:05Z2023-11-02T05:00:05ZVault Production Maintenance<p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>05:00</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>03:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Oct <var data-var='date'>20</var>, <var data-var='time'>19:25</var> UTC</small><br><strong>Scheduled</strong> - We will be updating our secrets backend. During this process we are making certain actions unavailable in order to ensure a smooth transition. During the maintenance window, it will not be possible to create new or update existing Clusters, Deployments, API keys, or Cloud IDE projects. DAGs will continue to function and you will not lose access to the UI to view progress or logs of your tasks.</p>tag:status.astronomer.io,2005:Incident/189935252023-11-01T22:36:58Z2023-11-01T22:36:58ZSome customers seeing a client-side error when trying to view deployments<p><small>Nov <var data-var='date'> 1</var>, <var data-var='time'>22:36</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Nov <var data-var='date'> 1</var>, <var data-var='time'>21:16</var> UTC</small><br><strong>Monitoring</strong> - The dev team has fixed the issue and we are monitoring the system.</p><p><small>Nov <var data-var='date'> 1</var>, <var data-var='time'>21:07</var> UTC</small><br><strong>Identified</strong> - This is a client-side issue only and only affects a subset of customers. All underlying data is safe. The team has identified the cause and is fixing the issue now.</p>tag:status.astronomer.io,2005:Incident/189717822023-10-30T22:31:16Z2023-11-03T20:31:10ZAstro UI and Worker Scaling Impacted by Authentication Provider Outage<p><small>Oct <var data-var='date'>30</var>, <var data-var='time'>22:31</var> UTC</small><br><strong>Resolved</strong> - Normal performance has returned to all Astro components.</p><p><small>Oct <var data-var='date'>30</var>, <var data-var='time'>20:48</var> UTC</small><br><strong>Update</strong> - We have seen some slowness in the UI as Astro recovers.<br /><br />We have identified that the reason for task failures was that on Deployments with KubernetesExecutor and dag-only-deployments enabled, worker pods could not start up successfully and marked their tasks as failed. Similarly, CeleryExecutor Deployments on Runtime 8+ mark tasks as failed after being queued for 10 minutes, and so the slow execution speed from degraded auto-scaling caused some tasks to be marked failed.</p><p><small>Oct <var data-var='date'>30</var>, <var data-var='time'>20:32</var> UTC</small><br><strong>Monitoring</strong> - The underlying provider outage appears to have ended. Access to Astro is now restored and appears to be functioning normally again. We did see that more tasks than usual failed during this period. We are continuing to investigate why these failures occurred. Please review your recent task runs as you may need to clear and restart tasks.</p><p><small>Oct <var data-var='date'>30</var>, <var data-var='time'>20:17</var> UTC</small><br><strong>Investigating</strong> - The authentication provider for Astro appears to be having a major outage and this is preventing access to Astro. This will not directly stop the successful execution of tasks, but it will impact auto scaling of workers, which could lead to much slower rates of task execution.</p>tag:status.astronomer.io,2005:Incident/189607302023-10-29T13:00:00Z2023-11-03T20:16:26ZWorker Scaling and UI Access Outage<p><small>Oct <var data-var='date'>29</var>, <var data-var='time'>13:00</var> UTC</small><br><strong>Resolved</strong> - Earlier today, there was an issue with Astro that prevented new workers from starting up. For customers who did deployments during this time, the lack of new workers meant that Airflow was not running any tasks. This same underlying cause also prevented some customers from viewing the Deployments page on the Astro UI. This issue is now resolved, and we will be conducting a full root cause analysis.</p>