The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.
During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.
It still lives on as a bit of a hard skill in automotive/robotics. As someone who crosses the divide between enterprise web software, and hacking about with embedded automotive bits, I don't really lament that we're not using WCET and Real Time OSes in web applications!
I suppose that rough-edgeness of the RTOSes is mostly due to that mainstream neglect for them - they are specific tools for seasoned professionals whose own edges are dent into shapes well-compatible for existing RTOSes.
I’ve working in automotive since it was only wires and never saw that (or noticed it) happening specially since usually body and powertrain work on separate buses tied through a gateway, the crazy stuff happens when people start treating the bus (specially the higher speed ones) like a 12v line or worst.
I didn't experience that but the commercial stuff I worked on was in a heavy industry on J1939, and our bus was isolated from the vehicle to some regard.
Then the stuff I mess with at home is 90s era CAN and it's basically all diagnostics, actually I think these particular cars don't do any control over the bus.
ever use wordstar on Z80 system with a 5 MB hard drive?
responsive. everything dealing with user interaction is fast. sure, reading a 1 MB document took time, but 'up 4 lines' was bam!.
linux ought to be this good, but the I/O subsystem slows down responsiveness. it should be possible to copy a file to a USB drive, and not impact good response from typing, but it is not. real time patches used to improve it.
windows has always been terrible.
what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.
QNX is also a good example of an RTOS that can be used as a desktop. Although an example with a lot of political and business problems.
Every single hardware subsystem adds lag. Double buffering adds a frame of lag; some do triple-buffering. USB adds ~8ms worse-case. LCD TVs add their own multi-frame lag-inducing processing, but even the ones that don't have to load the entire frame before any of it shows, which can be a substantial fraction of the time between frames.
Those old systems were "racing the beam", generating every pixel as it was being displayed. Minimum lag was microseconds. With LCDs you can't get under milliseconds. Luckily human visual perception isn't /that/ great so single-digit milliseconds could be as instantaneous, if you run at 100 Hz without double-buffering (is that even possible anymore!?) and use a low-latency keyboard (IIRC you can schedule more frequent USB frames at higher speeds) and only debounce on key release.
8khz polling rate mouse and keyboard, 240hz 4K monitor (with Oled to reduce smearing preferably, or it becomes very noticeable), 360hz 1440p, or 480hz 1080p, is current state of the art. You need a decent processor and GPU (especially the high refresh rate monitors as you’re pushing a huge amount data to your display, as only the newest GPUs support the newest display port standard) to run all this, but my Windows desktop is a joy to use because of all of this. Everything is super snappy. Alternatively, buying an iPad Pro is another excellent way to get very low latencies out of the box.
I believe this is kind of survivor-bias. It's very rare that RTOSes have to handle allocating GBs of data, or creating thousands of processes. I think if current RTOSes run the same application, there would be no noticeable difference compared to mainstream OS(Could be even worse because the OS is not designed for that kind of usecases)
>what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.
... it's not the OS that's source of majority of lag
Click around in this demo https://tracy.nereid.pl/
Note how basically any lag added is just some fancy animations in places and most of everything changes near instantly on user interaction (with biggest "lag" being acting on mouse key release as is tradition, not click, for some stuff like buttons).
This is still just browser, but running code and displaying it directly instead of going thru all the JS and DOM mess
> making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm.
Oh wow, really? I never knew that. huh.
I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)
Contrary to propaganda from the likes of Ludwig von Mises, the free market is not some kind of optimal solution to all of our problems. And it certainly does not produce excellent software.
Time triggered Ethernet is part of aircraft certified data bus and has a deep, decades long history. I believe INRIA did work on this, feeding Airbus maybe. It makes perfect sense when you can design for it. An aircraft is a bounded problem space of inputs and outputs which can have deterministic required minima and then you can build for it, and hopefully even have headroom for extras.
Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.
I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.
You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.
Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.
Microsoft fired all QA people ten or fifteen years ago. I'd imagine it's a similar a story: boxed software needed much higher guarantees of correctness. Digital deliver leaves much more room for error, because it leaves room for easier, cheaper fixes.
> “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”
Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.
I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.
>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a
>“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation
>due to a radiation event, the error is detected immediately and the system responds.
>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained.
>This approach simplifies the complex task of the triplex “voting” mechanism that compares results. >
>Instead of comparing three answers to find a majority, the system uses a priority-ordered source
>selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the
>first available FCM in the priority list; if that module has gone silent due to a fault, it moves to
>the second, third, or fourth.
One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.
Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?
I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.
When I was first starting out as a professional developer 25 years ago doing web development, I had a friend who had retired from NASA and had worked on Apollo.
I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.
The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.
NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.
I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.
Some related good books I have been studying the past few years or so. The Spark book is written by people who've worked on Cube sats:
* Logical Foundations of Cyber-Physical Systems
* Building High Integrity Applications with SPARK
* Analysable Real-Time Systems: Programmed in Ada
* Control Systems Safety Evaluation and Reliability (William M. Goble)
I am developing a high-integrity controls system for a prototype hoist to be certified for overhead hoisting with the highest safety standards and targeting aerospace, construction, entertainment, and defense.
I'm curious: In the current moon flyby, how often did some of these fallback methods get active? Was the BFS ever in control at any point? How many bitflips were there during the flight so far?
The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.
Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.
Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.
Raft consensus with pairs? I smell bulls*t there. Even when they say it's 8, it boils down to pair-wise checks, without any consensus. Just the consensus of wrong.
Also https://en.wikipedia.org/wiki/TTEthernet looks like bolting time-guaranteed switching networks onto randomizing ethernet hardware. Sounds incredibly cheap and stupid. Either stay with guaranteed real-time switching, or give up on hard real-time guarantees and favor performance, simplicity and cheap stock hardware.
I did VOS and database performance stuff at Stratus from 1989-95. Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”. Each board had redundant everything and was paired with a second board. Every pin out was compared on every tick. Boards that could not reset called home. The switch from Motorola 68K to Intel was a nightmare for the hardware group because some instructions had unused pins that could float.
Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?
I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.
"High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover."
I assume this means they are using a digital twin simulation inside the HPC?
Some people are claiming it's the good old RAD750 variant. Is there anything that talks about the actual computer architecture? The linked article is desperately void of technical details.
How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything
> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”
It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!
Multiple and dissimilar redundancy is nice and all that, but is there a manual override? Apollo could be (and at least in Apollo 11 and 13 it had to), but is this still possible and feasible? I'd guess so, as it's still manned by (former) test pilots, much like Apollo.
if I remember correctly the space shuttle had four computers that all did the same processing and a fifth that decided what was the correct answer if they all didn't match or some went down
can't find a wikipedia article on it but the times had an article in 1981
The part about triple-redundant voting systems genuinely blew my mind — it's such a different world from how most of us write software day to day, and honestly kind of humbling.
The Artemis computer handles way more flight functions than Apollo did. What are the practical benefits of that?
This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?
235 comments
since CAN all reliability and predictive nature was out. we now have redundancy everywhere with everything just rebooting all the time.
install an aftermarket radio and your ecu will probably reboot every time you press play or something. and that's just "normal".
Then the stuff I mess with at home is 90s era CAN and it's basically all diagnostics, actually I think these particular cars don't do any control over the bus.
responsive. everything dealing with user interaction is fast. sure, reading a 1 MB document took time, but 'up 4 lines' was bam!.
linux ought to be this good, but the I/O subsystem slows down responsiveness. it should be possible to copy a file to a USB drive, and not impact good response from typing, but it is not. real time patches used to improve it.
windows has always been terrible.
what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.
QNX is also a good example of an RTOS that can be used as a desktop. Although an example with a lot of political and business problems.
Those old systems were "racing the beam", generating every pixel as it was being displayed. Minimum lag was microseconds. With LCDs you can't get under milliseconds. Luckily human visual perception isn't /that/ great so single-digit milliseconds could be as instantaneous, if you run at 100 Hz without double-buffering (is that even possible anymore!?) and use a low-latency keyboard (IIRC you can schedule more frequent USB frames at higher speeds) and only debounce on key release.
I really love this blog post from Dan Luu about latency. https://danluu.com/input-lag/
>what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.
... it's not the OS that's source of majority of lag
Click around in this demo https://tracy.nereid.pl/ Note how basically any lag added is just some fancy animations in places and most of everything changes near instantly on user interaction (with biggest "lag" being acting on mouse key release as is tradition, not click, for some stuff like buttons).
This is still just browser, but running code and displaying it directly instead of going thru all the JS and DOM mess
> making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm.
Oh wow, really? I never knew that. huh.
I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)
Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.
I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.
That alone is worth my tax dollars.
Believe it or not, at least some of those modern practices (unit testing, CI, etc) do make a big (positive) difference there.
> “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”
Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.
I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.
>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.
>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.
One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.
I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.
I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.
The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.
Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.
Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.
Also https://en.wikipedia.org/wiki/TTEthernet looks like bolting time-guaranteed switching networks onto randomizing ethernet hardware. Sounds incredibly cheap and stupid. Either stay with guaranteed real-time switching, or give up on hard real-time guarantees and favor performance, simplicity and cheap stock hardware.
Monkeys in space.
Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?
I assume this means they are using a digital twin simulation inside the HPC?
> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”
It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!
I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?
can't find a wikipedia article on it but the times had an article in 1981
https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...
apparently the 5th was standby, not the decider
Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.
This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?
2.
Two.