Decisions that eroded trust in Azure – by a former Azure Core engineer (isolveproblems.substack.com)

by axelriet • 648 comments • 1297 points

648 comments

[−] branko_d 43d ago

I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):

"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."

Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.

[−] praptak 43d ago

This isn't incentivized in corporate environment.

Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.

The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.

[−] InsideOutSanta 43d ago

This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.

But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).

[−] delusional 43d ago

> For something like Azure, people are nor fungible

What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.

[−] silvestrov 43d ago

I would say "systems design" rather than low-demand.

People who can "reduce" a big system to build on a few simple concepts are few and far between. Most people just add more stuff instead.

[−] aeonik 42d ago

I think those people are around, they are just not rewarded by this kind of system. They can propose plans and fixes, they just don't get implemented.

[−] srirangr 42d ago

“Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.” - Edsger Wybe Dijkstra

[−] markus_zhang 42d ago

When things become too complicated, no one dares to make new systems. And if you don’t make new systems ofc you have to learn system design the other way around — by fixing every bug of existing systems.

[−] jimbokun 42d ago

Simple ain’t Easy

- Rich Hickey

[−] Whyachi 42d ago

[dead]

[−] Joel_Mckay 42d ago

There are often retention problems with lean budgets, and after training staff they often do just leave for a more lucrative position.

Loyalty will often not be rewarded, as most have seen companies purge decade long senior staff a year before going public.

It is very easy to become cynical about the mythology of silicon valley. =3

[−] auggierose 43d ago

What is a low-demand area?

[−] delusional 43d ago

A geographic area where there's not abundant opportunity for software developers. Usually everywhere outside the major metro areas. It was primarily meant to discount experiences from SF or Seattle where I'm sure finding talent is easy enough, assuming you are willing to pay.

[−] grvdrm 42d ago

I thought of this not as geographic but in terms of what’s sexy vs not. Low Demand = not

[−] markus_zhang 42d ago

This is a human problem. We humans praise the doctors that can put the patients with terminal illnesses alive for extended periods, but ignore those who tell us the principles to prevent getting those illnesses in the first place. We throw flowers and money to doctors who treat cancers, but do we do the same to the ones who tell us principles to avoid cancers? No.

The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.

That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.

[−] salemh 42d ago

[dead]

[−] steveBK123 42d ago

> You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch

I have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.

This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.

So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.

[−] raxxorraxor 38d ago

Problem is that if you want to be a serious cloud provider, you have to do exactly that. I slowly move my apps off of any Microsoft services, because they tend to be slow and buggy.

Also they too often remove features of their products and I have no desire to migrate working stuff because MS wants to move people to other products.

And these tend to be worse in recent times. Exemplary for that is PowerAutomate for me. Theoretically a neat tool that is well integrated into the cloud landscape. Practically you cannot implement reliable workflows with it because of numerous reasons.

> If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.

Well, it doesn't explode, but I really question how reliable some of these systems really are. In my experience, not at all. There was or is some genuinely good engineering below some of these systems, but I think all the buggy fluff build upon it really introduces friction.

[−] jimbokun 42d ago

Meanwhile, failure to clean up this particular mess was a key factor in losing a trillion dollars in market cap, according to the author.

[−] registeredcorn 40d ago

Perhaps an important question is: why is it not incentivized in corporate environments?

I think, however, that perhaps I'm asking in the wrong arena. Unless there are people here reading this who work in the areas of a corporate environment at the level at which those decisions are made, it would really amount to guessing and stereotypes. Generally, I like to think that just about anyone can grasp that a well-made product will sell better due to its nature. I think that there must be some kind of mutual disconnect between both sides where one continues to see improvements important, and the other fundamentally does not (or does not have a functional means to measure and verify it).

[−] cineticdaffodil 43d ago

Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.

[−] BobbyTables2 41d ago

It’s also a customer problem.

In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.

Especially for winning over new customers.

If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.

[−] philipallstar 42d ago

> This isn't incentivized in corporate environment.

Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.

[−] Agingcoder 42d ago

You do but you then make a career out of it : you become the fixer ( and it can be a very good career , either technical or managerial)

[−] monocasa 43d ago

No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...

Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.

[−] gherkinnn 43d ago

Once you reach this stage, the only escape is to jump ship. Either mentally or, ideally, truly.

You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.

[−] varispeed 42d ago

I was once in such a position. I persuaded management to first cover the entire project with extensive test suite before touching anything. It took us around 3 months to have "good" coverage and then we started refactor of parts that were 100% covered. 5 months in the shareholders got impatient and demanded "results". We were not ready yet and in their mind we were doing nothing. No amount of explanation helped and they thought we are just adding superficial work ("the project worked before and we were shipping new features! Maybe you are just not skilled enough?") Eventually they decided to scrap whole thing. Project was killed and entire team sacked.

[−] bob1029 42d ago

> first cover everything with tests

Beware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.

> management who do not fully understand the problem nor are incentivized to understand it

They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.

It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.

Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.

[−] hikarudo 43d ago

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs

The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.

[−] staticassertion 42d ago

It is so hard to test those codebases too. A lot of the time there's IO and implicit state changes through the code. Even getting testing in place, let alone good testing, is often an incredibly difficult task. And no one will refactor the code to make testing easier because they're too afraid to break the code.

[−] dbdr 43d ago

> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.

And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.

[−] sidewndr46 42d ago

Language choices won't save you here. The problem is organizational paralysis. Someone sees that the platform is unstable. They demand something be done to improve stability. The next management layer above them demands they reduce the number of changes made to improve stability.

[−] teeray 42d ago

Usually this results in approvals to approve the approval to approve making the change. Everyone signed off on a tower of tax forms about the change, no way it can fail now! It failed? We need another layer of approvals before changes can be made!

[−] cogman10 42d ago

Yeah I've seen that move pulled. Funnily enough by an ex-Microsoft manager.

[−] mike_hearn 43d ago

Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.

[−] cineticdaffodil 42d ago

In a rewrite you can smuggle in a quality lift

[−] neya 43d ago

Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.

[−] eviks 42d ago

Though this doesn't make much sense on its surface - a bug means something is already broken, and he tells of millions of crashes per month, so it was visibly broken. 100% chance of being broken (bug) > some chance of breakage from fixing it

(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")

[−] jiggawatts 42d ago

I've experienced a nearly identical scenario where a large fleet of identical servers (Citrix session hosts) were crashing at a "rate" high enough that I had to "scale up" my crash dump collection scripts with automated analysis, distribution into about a hundred buckets, and then per-bucket statistical analysis of the variables. I had to compress, archive, and then simply throw away crash dumps because I had too many.

It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.

I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.

I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.

"No." was the firm answer from management.

"Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.

You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.

[−] bombcar 41d ago

> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.

Isn't this where Oracle is with their DB? Wasn't HN complaining about that?

[−] idorosen 43d ago

Or to simplify the product and rebuild.

[−] egorfine 43d ago

writing tests and then meticulously fixing bugs does not increase shareholders' value.

[−] rk06 43d ago

once you reach the stage, the only escape is to give up on it. and move on.

somethings are beyond your control and capabilities

[−] doctorpangloss 43d ago

if the service is so shitty, why are people paying so much fucking money for it?

is microsoft committing an accounting fraud?

[−] yoyohello13 43d ago

I don't know if any of this is true, but as a user of Azure every day this would explain so much.

The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.

I'm honestly shocked anything manages to stay working at all.

[−] petterroea 43d ago

A business man at a prior employer sympathetic with my younger, naive "Microsoft sucks" attitude told me something I remember to this day:

Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.

This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.

This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.

[−] vintagedave 43d ago

What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.

> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.

Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?

More, why is this public not a court case for wrongful termination?

Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?

[−] Anon1096 43d ago

The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.

From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.

Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.

[−] Manouchehri 43d ago

I've seen Azure OpenAI leak other customer's prompt responses to us under heavy load.

https://x.com/DaveManouchehri/status/2037001748489949388

Nobody seems to care.

[−] pRusya 43d ago

It's a nice read. Thank you for sharing this.

> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.

This is what people should know when seeing massive layoffs due to AI.

[−] schlauerfox 43d ago

"For fiscal 2025, Microsoft CEO Satya Nadella earned total pay of $96.5 million, up 22% from a year earlier." -CNBC.com

and

"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut

[−] OldOneEye 43d ago

Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.

12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.

[−] nope1000 43d ago

> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.

> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.

That is quite scary

[−] _pdp_ 43d ago

The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.

The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.

Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.

[−] lokar 43d ago

This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.

And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.

[−] ludwigvan 43d ago

I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.

Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.