"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.
Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.
The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.
This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.
But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
> For something like Azure, people are nor fungible
What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
“Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.”
- Edsger Wybe Dijkstra
When things become too complicated, no one dares to make new systems. And if you don’t make new systems ofc you have to learn system design the other way around — by fixing every bug of existing systems.
A geographic area where there's not abundant opportunity for software developers. Usually everywhere outside the major metro areas. It was primarily meant to discount experiences from SF or Seattle where I'm sure finding talent is easy enough, assuming you are willing to pay.
This is a human problem. We humans praise the doctors that can put the patients with terminal illnesses alive for extended periods, but ignore those who tell us the principles to prevent getting those illnesses in the first place. We throw flowers and money to doctors who treat cancers, but do we do the same to the ones who tell us principles to avoid cancers? No.
The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.
That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.
> You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch
I have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.
This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.
So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.
Problem is that if you want to be a serious cloud provider, you have to do exactly that. I slowly move my apps off of any Microsoft services, because they tend to be slow and buggy.
Also they too often remove features of their products and I have no desire to migrate working stuff because MS wants to move people to other products.
And these tend to be worse in recent times. Exemplary for that is PowerAutomate for me. Theoretically a neat tool that is well integrated into the cloud landscape. Practically you cannot implement reliable workflows with it because of numerous reasons.
> If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.
Well, it doesn't explode, but I really question how reliable some of these systems really are. In my experience, not at all. There was or is some genuinely good engineering below some of these systems, but I think all the buggy fluff build upon it really introduces friction.
Perhaps an important question is: why is it not incentivized in corporate environments?
I think, however, that perhaps I'm asking in the wrong arena. Unless there are people here reading this who work in the areas of a corporate environment at the level at which those decisions are made, it would really amount to guessing and stereotypes. Generally, I like to think that just about anyone can grasp that a well-made product will sell better due to its nature. I think that there must be some kind of mutual disconnect between both sides where one continues to see improvements important, and the other fundamentally does not (or does not have a functional means to measure and verify it).
Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.
In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.
Especially for winning over new customers.
If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.
> This isn't incentivized in corporate environment.
Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.
No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...
Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.
I was once in such a position. I persuaded management to first cover the entire project with extensive test suite before touching anything. It took us around 3 months to have "good" coverage and then we started refactor of parts that were 100% covered. 5 months in the shareholders got impatient and demanded "results". We were not ready yet and in their mind we were doing nothing. No amount of explanation helped and they thought we are just adding superficial work ("the project worked before and we were shipping new features! Maybe you are just not skilled enough?") Eventually they decided to scrap whole thing. Project was killed and entire team sacked.
Beware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.
> management who do not fully understand the problem nor are incentivized to understand it
They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.
It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.
Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs
The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.
It is so hard to test those codebases too. A lot of the time there's IO and implicit state changes through the code. Even getting testing in place, let alone good testing, is often an incredibly difficult task. And no one will refactor the code to make testing easier because they're too afraid to break the code.
Language choices won't save you here. The problem is organizational paralysis. Someone sees that the platform is unstable. They demand something be done to improve stability. The next management layer above them demands they reduce the number of changes made to improve stability.
Usually this results in approvals to approve the approval to approve making the change. Everyone signed off on a tower of tax forms about the change, no way it can fail now! It failed? We need another layer of approvals before changes can be made!
Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.
Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.
Though this doesn't make much sense on its surface - a bug means something is already broken, and he tells of millions of crashes per month, so it was visibly broken. 100% chance of being broken (bug) > some chance of breakage from fixing it
(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")
I've experienced a nearly identical scenario where a large fleet of identical servers (Citrix session hosts) were crashing at a "rate" high enough that I had to "scale up" my crash dump collection scripts with automated analysis, distribution into about a hundred buckets, and then per-bucket statistical analysis of the variables. I had to compress, archive, and then simply throw away crash dumps because I had too many.
It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.
I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.
I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.
"No." was the firm answer from management.
"Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.
You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.
Isn't this where Oracle is with their DB? Wasn't HN complaining about that?
I don't know if any of this is true, but as a user of Azure every day this would explain so much.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
A business man at a prior employer sympathetic with my younger, naive "Microsoft sucks" attitude told me something I remember to this day:
Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.
This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.
This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.
What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
648 comments
"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.
Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.
The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.
But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
> For something like Azure, people are nor fungible
What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
People who can "reduce" a big system to build on a few simple concepts are few and far between. Most people just add more stuff instead.
- Rich Hickey
Loyalty will often not be rewarded, as most have seen companies purge decade long senior staff a year before going public.
It is very easy to become cynical about the mythology of silicon valley. =3
The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.
That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.
> You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch
I have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.
This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.
So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.
Also they too often remove features of their products and I have no desire to migrate working stuff because MS wants to move people to other products.
And these tend to be worse in recent times. Exemplary for that is PowerAutomate for me. Theoretically a neat tool that is well integrated into the cloud landscape. Practically you cannot implement reliable workflows with it because of numerous reasons.
> If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.
Well, it doesn't explode, but I really question how reliable some of these systems really are. In my experience, not at all. There was or is some genuinely good engineering below some of these systems, but I think all the buggy fluff build upon it really introduces friction.
I think, however, that perhaps I'm asking in the wrong arena. Unless there are people here reading this who work in the areas of a corporate environment at the level at which those decisions are made, it would really amount to guessing and stereotypes. Generally, I like to think that just about anyone can grasp that a well-made product will sell better due to its nature. I think that there must be some kind of mutual disconnect between both sides where one continues to see improvements important, and the other fundamentally does not (or does not have a functional means to measure and verify it).
In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.
Especially for winning over new customers.
If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.
> This isn't incentivized in corporate environment.
Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.
Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.
You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.
> first cover everything with tests
Beware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.
> management who do not fully understand the problem nor are incentivized to understand it
They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.
It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.
Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs
The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.
> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.
And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.
(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")
It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.
I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.
I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.
"No." was the firm answer from management.
"Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.
You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.
Isn't this where Oracle is with their DB? Wasn't HN complaining about that?
somethings are beyond your control and capabilities
is microsoft committing an accounting fraud?
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.
This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.
This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
https://x.com/DaveManouchehri/status/2037001748489949388
Nobody seems to care.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
and
"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
That is quite scary
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.