Simon Tatham, author of Putty, has quite a detailed blog post [0] on using the C++20's coroutine system. And yep, it's a lot to do on your own, C++26 really ought to give us some pre-built templates/patterns/scaffolds.
I find C++ coroutines to be well-designed. Most of the complexity is intrinsic because it tries to be un-opinionated. It allows precise control and customization of almost every conceivable coroutine behavior while still adhering to the principle of zero-cost abstractions.
Most people would prefer opinionated libraries that allow them to not think about the design tradeoffs. The core implementation is targeted at efficient creation of opinionated abstractions rather than providing one. This is the right choice. Every opinionated abstraction is going to be poor for some applications.
C++ standards follow a tick-tock schedule for complex features.
For the tick, the core language gets an un-opinionated iteration of the feature that is meant for compiler developers and library writers to play with. (This is why we sometimes see production compilers lagging behind in features).
For the tock, we try to get the standard library improved with these features to a realistic extent, and also fix wrinkles in the primary idea.
This avoids the standard library having to rely on any compiler magic (languages like swift are notorious for this), so in practice all libraries can leverage the language to the same extend.
This pattern has been broken in a few instances (std::initializer_list), and those have been widely considered to have been missteps.
Regarding your mention of compiler magic and Swift, I don’t know much about the language, but I have read a handful of discussions/blogs about the compiler and the techniques used for its implementation. One of the purported benefits/points of pride for Swift that stood out to me and I still remember was something to the effect of Swift being fundamentally against features/abstractions/‘things’ being built in. In particular they claimed the example of Swift not having any literal types (ints, sized ints, bools, etc) “built in” to the compiler but were defined in the language.
I don’t doubt your point (I know enough about Swift’s generic resolution crapshow during semantic analysis to be justified in assuming the worst) but can you think of any areas worth looking into for expansion of the compiler magic issues.
I have a near reflexive revulsion for the kinds of non-composability and destruction of principled, theoretically sound language design that tends to come from compiler magic and shortcuts, so always looking for more reading to enrage myself.
I don’t know if the language is yours, but I think the wording and its intended meaning (the sentence starting with ‘The core implementation…’) may be one of the most concise statements of my personal programming language design ethos. I’m jealous that I didn’t come up with it. I will certainly credit you when I steal it for my WIP language.
I will be adding the following to my “Primary Design Criteria” list: The core design and implementation of any language feature is explicitly targeted at the efficient creation of opinionated, composable abstractions rather than providing those abstractions at the language level.
async is simply a difficult problem, and I think we'll find irreducible complexity there. Sometimes you are just doing 2 or 3 things at once and you need a hand-written state machine with good unit tests around it. Sometimes you can't just glue 3 happy paths together into CSP and call it a day.
Using structured concurrency [1] as introduced in Python Trio [2] genuinely does help write much simpler concurrent code.
Also, as noted in that Simon Tatham article, Python makes choices at the language level that you have to fuss over yourself in C++. Given how different Trio is from asyncio (the async library in Python's standard library), it seems to me that making some of those basic choices wasn't actually that restrictive, so I'd guess that a lot of C++'s async complexity isn't that necessary for the problem.
After so wrote the comment below I realized that it really is just ‘um, actually…’ about discussing using concurrency vs implementing it. It’s probably not needed, but I do like my wording so I’m posting it for personal posterity.
In the context of an article about C++’s coroutines for building concurrency I think structured concurrency is out of scope. Structured concurrency is an effective and, reasonably, efficient idiom for handling a substantial percentage of concurrent workloads (which in light of your parent’s comment is probably why you brought up structured concurrency as a solution); however, C++ coroutines are pitched several levels of abstraction below where structured concurrency is implemented.
Additionally, there is the implementation requirements to have Trio style structured concurrency function. I’m almost certain a garbage collector is not required so that probably isn’t an issue, but, the implementation of the nurseries and the associated memory management required are independent implementations that C++ will almost certainly never impose as a base requirement to have concurrency. There are also some pretty effective cancelation strategies presumed in Trio which would also have to be positioned as requirements.
Not really a critique on the idiom, but I think it’s worth mentioning that a higher level solution is not always applicable given a lower level language feature’s expected usage. Particularly where implementing concurrency, as in the C++ coroutines, versus using concurrency, as in Trio.
Good point. I did carefully say that Trio "introduced" structured concurrency, partly due to this (and also other languages that now use it e.g. Swift, Kotlin).
I will say that it's still not as nice as using Trio. Partly that's because it has edge-triggered cancellation (calling task.cancel() injects a single cancellation exception) rather than Trio's level-triggered cancellation (once a scope is cancelled, including the scope implicit in a nursery, it stays cancelled so future async calls all throw Cancelled unless shielded). The interaction between asyncio TaskGroup and its older task API is also really awkward (how do I update the task's cancelled count if an unrelated task I'm waiting on throws Cancelled?). But it's a huge improvement if you're forced to use asyncio.
C++ is great, coroutines are not. Neither of these are good ways to handle concurrency. You really need a more generalized graph and to minimize threads and context switching. You can't do more than the number of logical cores on a CPU anyway.
Not really, because due to C++'s unsafe first approach, means that workarounds like Pin aren't required.
Additionally, for those with .NET background, C++ co-routines are pretty much inspired by how they work in .NET/C#, naturally with the added hurdle there isn't a GC, and there is some memory management to take into account.
Also so even if it takes some time across ISO working processes, there is still a goal to have some capabilities on the standard library, that in Rust's case means "use tokio" instead.
For a layperson it's clear that it's either "Writings" and "Talks", or "Readings" and "'Listenings", but CPP profeciency is in an inverse relation with being apt in taxonomy, it looks like.
You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly. It's a matter of saving a few registers and switching the stack pointer, minicoro [1] is a pretty good C library that does it. I like this model a lot more than C++20 coroutines:
1. C++20 coros are stackless, in the general case every async "function call" heap allocates.
2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.
3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all
> You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly
I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.
(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)
Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly. But maybe not.
That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.
C++ destructors and exception safety will likely wreak havoc with any "simple" assembly/longjmp-based solution, unless severely constraining what types you can use within the coroutines.
That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.
One can define:
void *operator new(size_t sz, Foo &foo)
in the coro's promise type, and this:
- removes the implicitly-defined operator new
- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined
Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.
Yes, green threads ("stackful coroutines") are more straightforward to use, however:
- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)
- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too
Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)
Not an expert in game development, but I'd say the issue with C++ coroutines (and 'colored' async functions in general) is that the whole call stack must be written to support that. From a practical perspective, that must in turn be backed by a multithreaded event loop to be useful, which is very difficult to write performantly and correctly. Hence, most people end up using coroutines with something like boost::asio, but you can do that only if your repo allows a 'kitchen sink' library like Boost in the first place.
More broadly the dimension of time is always a problem in gamedev, where you're partially inching everything forward each frame and having to keep it all coherent across them.
It can easily and often does lead to messy rube goldberg machines.
There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.
As the author lays out, the thing that made coroutines click for me was the isomorphism with state machine-driven control flow.
That’s similar to most of what makes C++ tick: There’s no deep magic, it’s “just” type-checked syntactic sugar for code patterns you could already implement in C.
(Occurs to me that the exceptions to this … like exceptions, overloads, and context-dependent lookup … are where C++ has struggled to manage its own complexity.)
This is one reason why I built coroutines into my game programming language Easel (https://easel.games). I think they let you keep the flow of the code matching the flow of the your logic (top-to-bottom), rather than jumping around, and so I think they are a great tool for high-level programming. The main thing is stopping the coroutines when the entity dies, and in Easel that is done by implying ownership from the context they are created in. It is quite a cool way of coding I think, avoids the state machines like the OP stated, keeps everything straightforward step-by-step and so all the code feels more natural in my opinion. In Easel they are called behaviors if anyone is interested in more detail: https://easel.games/docs/learn/language/behaviors
I don't know, I'm not convinced with this argument.
The "ugly" version with the switch seems much preferable to me.
It's simple, works, has way less moving parts and does not require complex machinery to be built into the language. I'm open to being convinced otherwise but as it stands I'm not seeing any horrible problems with it.
Coroutines generally imply some sort of magic to me.
I would just go straight to tbb and concurrent_unordered_map!
The challenge of parallelism does not come from how to make things parallel, but how you share memory:
How you avoid cache misses, make sure threads don't trample each other and design the higher level abstraction so that all layers can benefit from the performance without suffering turnaround problems.
My challenge right now is how do I make the JVM fast on native memory:
1) Rewrite my own JVM.
2) Use the buffer and offset structure Oracle still has but has deprecated and is encouraging people to not use.
We need Java/C# (already has it but is terrible to write native/VM code for?) with bottlenecks at native performance and one way or the other somebody is going to have to write it?
I've been doing a lot of work with ECS/Dots recently and once I wrapped my head around it - amazing.
I recall working on a few VR projects - where it's imperative that you keep that framerate solid or risk making the user physically sick - this is where really began using coroutines for instantiating large volumes of objects and so on (and avoiding framerate stutter).
ECS/Dots & the burst compiler makes all of this unnecessary and the performance is nothing short of incredible.
This is quite understandable when you know the history behind how C++ coroutines came to be.
They were initially proposed by Microsoft, based on a C++/CX extension, that was inspired by .NET async/await implementation, as the WinRT runtime was designed to only support asynchronous code.
Thus if one knows how the .NET compiler and runtime magic works, including custom awaitable types, there will be some common bridges to how C++ co-routines ended up looking like.
Stackless coroutines in C when? As an embedded dev, I miss them deeply. Certainly not enough RAM to give a separate stack for everything and rewriting every async call as a callback sequence sucks.
The 'primitive' SCUMM language used for writing Adventure Games like Maniac Mansion had coroutines - an ill fated attempt to convert to using Python was hampered by Python (at the time) having no support for yield.
I do not find so called "green threads" useful at all. In my opinion except some very esoteric cases they serve no purpose in "native" languages that have full access to all OS threading and IO facilities. Useful only in "deficient" environments like inherently single threaded request handlers like NodeJS.
180 comments
[0] https://web.archive.org/web/20260105235513/https://www.chiar...
Most people would prefer opinionated libraries that allow them to not think about the design tradeoffs. The core implementation is targeted at efficient creation of opinionated abstractions rather than providing one. This is the right choice. Every opinionated abstraction is going to be poor for some applications.
For the
tick, the core language gets an un-opinionated iteration of the feature that is meant for compiler developers and library writers to play with. (This is why we sometimes see production compilers lagging behind in features).For the
tock, we try to get the standard library improved with these features to a realistic extent, and also fix wrinkles in the primary idea.This avoids the standard library having to rely on any compiler magic (languages like swift are notorious for this), so in practice all libraries can leverage the language to the same extend.
This pattern has been broken in a few instances (std::initializer_list), and those have been widely considered to have been missteps.
I don’t doubt your point (I know enough about Swift’s generic resolution crapshow during semantic analysis to be justified in assuming the worst) but can you think of any areas worth looking into for expansion of the compiler magic issues.
I have a near reflexive revulsion for the kinds of non-composability and destruction of principled, theoretically sound language design that tends to come from compiler magic and shortcuts, so always looking for more reading to enrage myself.
> literal types (ints, sized ints, bools, etc) “built in” to the compiler but were defined in the language.
This is actually a good example by itself.
Int is defined in swift with Builtin.int64 IIRC. That is not part of the swift language.
I will be adding the following to my “Primary Design Criteria” list: The core design and implementation of any language feature is explicitly targeted at the efficient creation of opinionated, composable abstractions rather than providing those abstractions at the language level.
Also, as noted in that Simon Tatham article, Python makes choices at the language level that you have to fuss over yourself in C++. Given how different Trio is from asyncio (the async library in Python's standard library), it seems to me that making some of those basic choices wasn't actually that restrictive, so I'd guess that a lot of C++'s async complexity isn't that necessary for the problem.
[1] https://vorpus.org/blog/notes-on-structured-concurrency-or-g...
[2] https://trio.readthedocs.io/en/stable/
In the context of an article about C++’s coroutines for building concurrency I think structured concurrency is out of scope. Structured concurrency is an effective and, reasonably, efficient idiom for handling a substantial percentage of concurrent workloads (which in light of your parent’s comment is probably why you brought up structured concurrency as a solution); however, C++ coroutines are pitched several levels of abstraction below where structured concurrency is implemented.
Additionally, there is the implementation requirements to have Trio style structured concurrency function. I’m almost certain a garbage collector is not required so that probably isn’t an issue, but, the implementation of the nurseries and the associated memory management required are independent implementations that C++ will almost certainly never impose as a base requirement to have concurrency. There are also some pretty effective cancelation strategies presumed in Trio which would also have to be positioned as requirements.
Not really a critique on the idiom, but I think it’s worth mentioning that a higher level solution is not always applicable given a lower level language feature’s expected usage. Particularly where implementing concurrency, as in the C++ coroutines, versus using concurrency, as in Trio.
[1] https://docs.python.org/3/library/asyncio-task.html#id6
[2] https://github.com/python/cpython/issues/90908
I will say that it's still not as nice as using Trio. Partly that's because it has edge-triggered cancellation (calling task.cancel() injects a single cancellation exception) rather than Trio's level-triggered cancellation (once a scope is cancelled, including the scope implicit in a nursery, it stays cancelled so future async calls all throw Cancelled unless shielded). The interaction between asyncio TaskGroup and its older task API is also really awkward (how do I update the task's cancelled count if an unrelated task I'm waiting on throws Cancelled?). But it's a huge improvement if you're forced to use asyncio.
Additionally, for those with .NET background, C++ co-routines are pretty much inspired by how they work in .NET/C#, naturally with the added hurdle there isn't a GC, and there is some memory management to take into account.
Also so even if it takes some time across ISO working processes, there is still a goal to have some capabilities on the standard library, that in Rust's case means "use tokio" instead.
Thanks for the list.
1. C++20 coros are stackless, in the general case every async "function call" heap allocates.
2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.
3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all
[1] https://github.com/edubart/minicoro
> You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly
I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.
(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)
That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.
> every async "function call" heap allocates.
> require the STL
That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.
One can define:
void *operator new(size_t sz, Foo &foo)
in the coro's promise type, and this:
- removes the implicitly-defined operator new
- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined
Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.
Yes, green threads ("stackful coroutines") are more straightforward to use, however:
- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)
- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too
Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)
It can easily and often does lead to messy rube goldberg machines.
There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.
That’s similar to most of what makes C++ tick: There’s no deep magic, it’s “just” type-checked syntactic sugar for code patterns you could already implement in C.
(Occurs to me that the exceptions to this … like exceptions, overloads, and context-dependent lookup … are where C++ has struggled to manage its own complexity.)
> turns it into some sort of ugly state machine
Why are people afraid of state machines? There's been sooo much effort spent on hiding them from the programmer...
The "ugly" version with the switch seems much preferable to me. It's simple, works, has way less moving parts and does not require complex machinery to be built into the language. I'm open to being convinced otherwise but as it stands I'm not seeing any horrible problems with it.
I would just go straight to tbb and concurrent_unordered_map!
The challenge of parallelism does not come from how to make things parallel, but how you share memory:
How you avoid cache misses, make sure threads don't trample each other and design the higher level abstraction so that all layers can benefit from the performance without suffering turnaround problems.
My challenge right now is how do I make the JVM fast on native memory:
1) Rewrite my own JVM. 2) Use the buffer and offset structure Oracle still has but has deprecated and is encouraging people to not use.
We need Java/C# (already has it but is terrible to write native/VM code for?) with bottlenecks at native performance and one way or the other somebody is going to have to write it?
I recall working on a few VR projects - where it's imperative that you keep that framerate solid or risk making the user physically sick - this is where really began using coroutines for instantiating large volumes of objects and so on (and avoiding framerate stutter).
ECS/Dots & the burst compiler makes all of this unnecessary and the performance is nothing short of incredible.
This is quite understandable when you know the history behind how C++ coroutines came to be.
They were initially proposed by Microsoft, based on a C++/CX extension, that was inspired by .NET async/await implementation, as the WinRT runtime was designed to only support asynchronous code.
Thus if one knows how the .NET compiler and runtime magic works, including custom awaitable types, there will be some common bridges to how C++ co-routines ended up looking like.
I never understood the value. Just use lambdas/callbacks.
>> To misquote Kennedy, “we chose to focus coroutines on generator in C++23, not because it is hard, but because it is easy”.
Appreciate this humor -- absurd, tasteful.