How NASA built Artemis II’s fault-tolerant computer (cacm.acm.org)

by speckx 235 comments 645 points
Read article View on HN

235 comments

[−] dmk 35d ago
The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.
[−] carefree-bob 35d ago
During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.
[−] ehnto 35d ago
It still lives on as a bit of a hard skill in automotive/robotics. As someone who crosses the divide between enterprise web software, and hacking about with embedded automotive bits, I don't really lament that we're not using WCET and Real Time OSes in web applications!
[−] nnevod 35d ago
I suppose that rough-edgeness of the RTOSes is mostly due to that mainstream neglect for them - they are specific tools for seasoned professionals whose own edges are dent into shapes well-compatible for existing RTOSes.
[−] iririririr 35d ago
if you ever worked on automotive you know it's bs.

since CAN all reliability and predictive nature was out. we now have redundancy everywhere with everything just rebooting all the time.

install an aftermarket radio and your ecu will probably reboot every time you press play or something. and that's just "normal".

[−] neuralRiot 35d ago
I’ve working in automotive since it was only wires and never saw that (or noticed it) happening specially since usually body and powertrain work on separate buses tied through a gateway, the crazy stuff happens when people start treating the bus (specially the higher speed ones) like a 12v line or worst.
[−] ehnto 33d ago
I didn't experience that but the commercial stuff I worked on was in a heavy industry on J1939, and our bus was isolated from the vehicle to some regard.

Then the stuff I mess with at home is 90s era CAN and it's basically all diagnostics, actually I think these particular cars don't do any control over the bus.

[−] budman1 35d ago
ever use wordstar on Z80 system with a 5 MB hard drive?

responsive. everything dealing with user interaction is fast. sure, reading a 1 MB document took time, but 'up 4 lines' was bam!.

linux ought to be this good, but the I/O subsystem slows down responsiveness. it should be possible to copy a file to a USB drive, and not impact good response from typing, but it is not. real time patches used to improve it.

windows has always been terrible.

what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.

QNX is also a good example of an RTOS that can be used as a desktop. Although an example with a lot of political and business problems.

[−] coryrc 35d ago
Every single hardware subsystem adds lag. Double buffering adds a frame of lag; some do triple-buffering. USB adds ~8ms worse-case. LCD TVs add their own multi-frame lag-inducing processing, but even the ones that don't have to load the entire frame before any of it shows, which can be a substantial fraction of the time between frames.

Those old systems were "racing the beam", generating every pixel as it was being displayed. Minimum lag was microseconds. With LCDs you can't get under milliseconds. Luckily human visual perception isn't /that/ great so single-digit milliseconds could be as instantaneous, if you run at 100 Hz without double-buffering (is that even possible anymore!?) and use a low-latency keyboard (IIRC you can schedule more frequent USB frames at higher speeds) and only debounce on key release.

[−] wincy 35d ago
8khz polling rate mouse and keyboard, 240hz 4K monitor (with Oled to reduce smearing preferably, or it becomes very noticeable), 360hz 1440p, or 480hz 1080p, is current state of the art. You need a decent processor and GPU (especially the high refresh rate monitors as you’re pushing a huge amount data to your display, as only the newest GPUs support the newest display port standard) to run all this, but my Windows desktop is a joy to use because of all of this. Everything is super snappy. Alternatively, buying an iPad Pro is another excellent way to get very low latencies out of the box.

I really love this blog post from Dan Luu about latency. https://danluu.com/input-lag/

[−] mrheosuper 35d ago
I believe this is kind of survivor-bias. It's very rare that RTOSes have to handle allocating GBs of data, or creating thousands of processes. I think if current RTOSes run the same application, there would be no noticeable difference compared to mainstream OS(Could be even worse because the OS is not designed for that kind of usecases)
[−] PunchyHamster 35d ago

>what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.

... it's not the OS that's source of majority of lag

Click around in this demo https://tracy.nereid.pl/ Note how basically any lag added is just some fancy animations in places and most of everything changes near instantly on user interaction (with biggest "lag" being acting on mouse key release as is tradition, not click, for some stuff like buttons).

This is still just browser, but running code and displaying it directly instead of going thru all the JS and DOM mess

[−] sigbottle 35d ago

> making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm.

Oh wow, really? I never knew that. huh.

I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)

[−] therobots927 35d ago
Contrary to propaganda from the likes of Ludwig von Mises, the free market is not some kind of optimal solution to all of our problems. And it certainly does not produce excellent software.
[−] ggm 35d ago
Time triggered Ethernet is part of aircraft certified data bus and has a deep, decades long history. I believe INRIA did work on this, feeding Airbus maybe. It makes perfect sense when you can design for it. An aircraft is a bounded problem space of inputs and outputs which can have deterministic required minima and then you can build for it, and hopefully even have headroom for extras.

Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.

I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.

[−] arduanika 35d ago
You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

That alone is worth my tax dollars.

[−] dyauspitr 35d ago
Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.
[−] iknowstuff 35d ago
Tesla’s Cybertruck uses that in its ethernet as well!
[−] anymouse123456 35d ago
Some of us still work on embedded systems with real-time guarantees.

Believe it or not, at least some of those modern practices (unit testing, CI, etc) do make a big (positive) difference there.

[−] globnomulous 33d ago
Microsoft fired all QA people ten or fifteen years ago. I'd imagine it's a similar a story: boxed software needed much higher guarantees of correctness. Digital deliver leaves much more room for error, because it leaves room for easier, cheaper fixes.
[−] tayk47999 35d ago

> “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”

Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.

I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.

[−] georgehm 35d ago

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

[−] __d 35d ago
Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

[−] TonyAlicea10 35d ago
When I was first starting out as a professional developer 25 years ago doing web development, I had a friend who had retired from NASA and had worked on Apollo.

I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.

The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.

[−] y1n0 35d ago
NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.
[−] geomark 35d ago
I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.
[−] eggy 35d ago
Some related good books I have been studying the past few years or so. The Spark book is written by people who've worked on Cube sats:

  * Logical Foundations of Cyber-Physical Systems

  * Building High Integrity Applications with SPARK 

  * Analysable Real-Time Systems: Programmed in Ada

  * Control Systems Safety Evaluation and Reliability (William M. Goble)
I am developing a high-integrity controls system for a prototype hoist to be certified for overhead hoisting with the highest safety standards and targeting aerospace, construction, entertainment, and defense.
[−] albertzeyer 35d ago
I'm curious: In the current moon flyby, how often did some of these fallback methods get active? Was the BFS ever in control at any point? How many bitflips were there during the flight so far?
[−] nickpsecurity 35d ago
The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

[−] rurban 34d ago
Raft consensus with pairs? I smell bulls*t there. Even when they say it's 8, it boils down to pair-wise checks, without any consensus. Just the consensus of wrong.

Also https://en.wikipedia.org/wiki/TTEthernet looks like bolting time-guaranteed switching networks onto randomizing ethernet hardware. Sounds incredibly cheap and stupid. Either stay with guaranteed real-time switching, or give up on hard real-time guarantees and favor performance, simplicity and cheap stock hardware.

Monkeys in space.

[−] dojopico 35d ago
I did VOS and database performance stuff at Stratus from 1989-95. Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”. Each board had redundant everything and was paired with a second board. Every pin out was compared on every tick. Boards that could not reset called home. The switch from Motorola 68K to Intel was a nightmare for the hardware group because some instructions had unused pins that could float.
[−] gambiting 35d ago
So honest and perhaps a bit stupid question.

Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?

[−] jbritton 35d ago
I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.
[−] starkparker 36d ago
Headline needs its how-dectomy reverted to make sense
[−] Schlagbohrer 35d ago
"High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover."

I assume this means they are using a digital twin simulation inside the HPC?

[−] dom111 35d ago
I always wondered if the "radiation hardening" approaches of the challenges like this https://codegolf.stackexchange.com/questions/57257/radiation... (see the tag for more https://codegolf.stackexchange.com/questions/tagged/radiatio...) would be of any practical use... I assume not, as the problem is on too many levels, but still, seems at least tangentially relevant!
[−] kev009 35d ago
Some people are claiming it's the good old RAD750 variant. Is there anything that talks about the actual computer architecture? The linked article is desperately void of technical details.
[−] object-a 35d ago
How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything
[−] JumpCrisscross 35d ago
Does anyone know how this compares to Crew Dragon or HLS?
[−] vhiremath4 35d ago

> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!

[−] guenthert 35d ago
Multiple and dissimilar redundancy is nice and all that, but is there a manual override? Apollo could be (and at least in Apollo 11 and 13 it had to), but is this still possible and feasible? I'd guess so, as it's still manned by (former) test pilots, much like Apollo.
[−] lrvick 35d ago
NASA describes some impressive work for runtime integrity, but the lack of mention of build-time security is surprising.

I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?

[−] ck2 35d ago
if I remember correctly the space shuttle had four computers that all did the same processing and a fifth that decided what was the correct answer if they all didn't match or some went down

can't find a wikipedia article on it but the times had an article in 1981

https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...

apparently the 5th was standby, not the decider

[−] bharat1010 35d ago
The part about triple-redundant voting systems genuinely blew my mind — it's such a different world from how most of us write software day to day, and honestly kind of humbling.
[−] PunchyHamster 35d ago
I wonder how they made the voted-answer-picker fail-resistant
[−] spaceman123 35d ago
Probably same way they’ve built fault-tolerant toilet.
[−] stevepotter 35d ago
It would be nice to see some of the software source. I’m super interested and i think I helped pay for it
[−] SeanAnderson 35d ago
Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

[−] pbronez 35d ago
The Artemis computer handles way more flight functions than Apollo did. What are the practical benefits of that?

This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?

[−] 0xblinq 35d ago
They should have also built a fault tolerant toilette.
[−] RobRivera 35d ago
2 outlooks.

2.

Two.

[−] hpcgroup 35d ago
[flagged]
[−] veunes 35d ago
[dead]
[−] temptemptemp111 35d ago
[dead]
[−] perarneng 35d ago
[dead]
[−] ConanRus 35d ago
[dead]
[−] hulitu 35d ago
They run 2 Outlook instances. For redundancy. /s
[−] huxleyFiddler 35d ago
[flagged]
[−] ajaystream 35d ago
[flagged]
[−] GautamB13 35d ago
It kinda crazy how this mission didn't become mainstream media until as of late.