Union types in C# 15 (devblogs.microsoft.com)

by 0x00C0FFEE • 211 comments • 225 points

211 comments

[−] amluto 37d ago

Hmm, they seem to have chosen to avoid names to the choices in the union, joining C++ variants and (sort of) TypeScript unions: unions are effectively just defined by a collection of types.

Other languages have unions with named choices, where each name selects a type and the types are not necessarily all different. Rust, Haskell, Lean4, and even plain C unions are in this category (although plain C unions are not discriminated at all, so they’re not nearly as convenient).

I personally much prefer the latter design.

[−] Metasyntactic 37d ago

Hi there! One of the C# language designers here, working on unions. We're interesting in both forms! We decided to go with this first as felt there was the most value here, and we could build the named form on top of this. In no way are we thinking the feature is done in C#15. But it's part of our ongoing evolution.

If you're interested, i can point you to specs i'm writing that address the area you care about :)

[−] novaleaf 36d ago

I think it would be a considerable improvement to allow duck-typing over the top. implicitly defined interface that includes exact member matches, something like that.

[−] roflcopter69 36d ago

Hi! Love to see that C# language designers are here on HN :)

Just wanted to add just another opinion on a few things.

I think many people already mentioned it, but I also don't feel to good about non-boxed unions not being the default. I'd personally like the path of least resistance to lead to not boxing. Having to opt-in like the current preview shows it looks like a PITA that I'd quickly become tired of.

Also, ad-hoc union types could be really nice. At least in Python those are really nice, stuff like def foo(x: str | int) is just very nice. If I had to first give this union type a name I'd enjoy it way less.

But I'm aware that you are trying your best to find a good trade-off and I'm sure I don't know all the implications of the things I wish you'd do. But I just wanted to mention those to have another data point you can weigh into your decision process.

[−] Metasyntactic 36d ago

> Having to opt-in like the current preview shows it looks like a PITA that I'd quickly become tired of.

My belief is that we will have a union struct just like we have record and record struct. Where you can simply say you want value-type, non-boxing, behavior by adding the struct keyword. This feels very nice to me, and in line with how we've treated other similar types. You'll only have to state this on the decl point, so for each union type it's one and done.

[−] roflcopter69 35d ago

That actually sounds like a promising idea. Thanks for putting so much thought into this.

[−] jsmith45 36d ago

> I think many people already mentioned it, but I also don't feel to good about non-boxed unions not being the default. I'd personally like the path of least resistance to lead to not boxing. Having to opt-in like the current preview shows it looks like a PITA that I'd quickly become tired of.

The problem is that the only safe way for the compiler to generate non-boxed unions would require non-overlapping fields for most value types.

Specifically the CLR has a hard rule that it must know with certainty where all managed pointers are at all times, so that the GC can update them if it moves the referenced object. This means you can only overlap value types if the locations of all managed pointers line up perfectly. So sure, you can safely overlap "unmanaged" structs (those that recursively don't contain any managed pointers), but even for those, you need to know the size of the largest one.

The big problem with the compiler doing any attempt to overlap value types is that if the value types as defined at compile time may not match the definitions at runtime, especially for types defined in another assembly. A new library version can add more fields. This may mean one unmanaged struct has become too big to fit in the field, or that two types that were previously overlap compatible are not anymore.

Making the C# compiler jump though a bunch of hoops to try to determine if overlapping is safe and even then leaving room for an updated library at runtime to crash the whole things means that the compiler will probably never even try. I guess the primitive numeric types could be special cased, as their size is known and will never change.

[−] Shalomboy 37d ago

Not OP but I would love to check that out

[−] Metasyntactic 37d ago

Sure, as an example: https://github.com/dotnet/csharplang/blob/main/meetings/work...

Again, very rough. We go round and round on things. Loving to decompose, debate and determine how we want to tackle all these large and interesting areas :)

[−] amluto 37d ago

Nice! (Well, nice the first time I loaded the page, but GitHub appears to be rocking maybe 90% uptime today and I can’t see it anymore.)

I admit that I haven’t actually used C# in 20 years or so.

[−] moi2388 36d ago

Oh I’m very happy to hear this is being worked on!

[−] gf000 36d ago

Careful not to mix unions with sum types, though. The key distinction is that the latter are disjunct sets, even if you "sum" together the same type twice, you can always tell which "way" you went.

An example that may show the difference: if you have a language with nullable types, then you basically have a language with union types like String|Null, where the Null type has a single value called null and String can not be null.

Now if you pass this around a function that itself may return null, then your type coalesces to String|Null still (you still get a nullable string, there is no doubly nullable). This is not true for Maybe/Option whatever you call types, where Some(None) (or Optional.of(Optional.empty())) is different from None only.

Rich Hickey once made a case that sort of became controversial in some FP circles, that the former can sometimes be preferred (e.g. at public API surfaces), as in for a parameter you take a non-nullable String but for returns you return a String|Null. In this case you can have an API-compatible change widening the input parameters' type, or restricting the return type - meanwhile with sum types you would have to do a refactor because Maybe String is not API compatible with String.

[−] Zecc 36d ago

> Careful not to mix unions with sum types, though. The key distinction is that the latter are disjunct sets, even if you "sum" together the same type twice, you can always tell which "way" you went.

This is a really good point. I'd love to be able to have a sum type of two strings ("escaped" and "unescaped"); or any two kinds of the same type really, to model two kinds of the same type where one has already passed some sort of validation and the other one hasn't.

Edit to add: I figure what I want is for enums to be extended such that different branches are able to carry different properties.

Edit again (I should learn to think things through before posting. sorry): I suppose it can be faked using a union of different wrapper types, and in fact it might be the best way to do it so then methods can take just one of the types in arguments and maybe even provide different overloads.

[−] repelsteeltje 37d ago

Not sure, but I think C++ actually does allow std::variant with multiple choices using the same type. You might not be able to distinguish between them by type (using get()), but you can by position (get<0>(), get<1>(), ...)

[−] amluto 37d ago

I haven’t tried this, and I don’t intend to, because visitors and similar won’t work (how could they?) and I don’t want to have to think about which is choice 2 and which is choice 7.

[−] NooneAtAll3 37d ago

I think GP is talking about name-of-field access, not index access or name-of-type

[−] tialaramex 37d ago

The C and Rust union types are extremely sharp blades, enough so that I expect the average Rust beginner doesn't even know Rust has unions (and I assume you were thinking of Rust's enum not union)

I've seen exactly one Rust type which is actually a union, and it's a pretty good justification for the existence of this feature, but one isn't really enough. That type is MaybeUninit which is a union of a T and the empty tuple. Very, very, clever and valuable, but I didn't run into any similarly good uses outside that.

[−] masklinn 37d ago

Unions can be used as a somewhat safer (not safe by any means but safer), more flexible, and less error-prone form of transmute. Notably you can use unions to transmute between a large type and a smaller type.

That is essentially the motivation, primarily in the context of FFI where matching C's union behaviour using transmute is tricky and error-prone.

[−] randomNumber7 37d ago

There are rare cases where all attributes of the C union are valid at the same time. Say you have a 32-bit RGBA color value and you want to access the individual 8 bit values. You can make a union of an 32 bit int and a struct that contains 4x 8 bit integers.

Also you can manually tag them and get s.th. more like other high level languages. It will just look ugly.

[−] Animats 37d ago

Yes. I once wanted C unions limited to fully mapped type conversions, where any bit pattern in either type is a valid bit pattern in the other. Then you can map two "char" to "int". Even "float". But pointer types must match exactly.

If you want disjoint types, something like Pascal's discriminated variants or Rust's enums is the way to go. It's embarrassing that C never had this.

Many bad design decision in C come from the fact that originally, separate compilation was really dumb, so the compiler would fit in small machines.

[−] pjmlp 36d ago

In Rust's case, union types should only be used for FFI with a C ABI.

As for C, it is a sharp blade on its own.

[−] tialaramex 36d ago

I do not agree. MaybeUninit is without any doubt more valuable than the C FFI use

I can't even think of any prominent C FFI problems where I'd reach for the union's C representation. Too many languages can't handle that so it seems less useful at an FFI edge.

[−] pjmlp 36d ago

OS APIs for one, at least there are some Win32 calls that take unions if I remember correctly.

One of the reasons .NET had Managed C++, replaced by C++/CLI (nowadays C++20 compliant, minus modules), is exactly that P/Invoke (and RCW/CCW) cannot represent everything.

Which they don't want to expose on .NET type system directly.

[−] SkiFire13 36d ago

FYI small string optimizations are generally implemented using unions.

[−] tialaramex 36d ago

In Rust? The two I'm big fans of, CompactString and ColdString do not use unions although historically CompactString did so and it still has a dependency on smallvec's union feature

ColdString is easier to explain, the whole trick here is the "Maybe this isn't a pointer?" trick, ColdString might be a single raw pointer onto your heap with the rest of the data structure at the far end of the pointer, this case is expensive because nothing about the text lives inline, but... the other case is that your entire text was hidden in the pointer, on modern hardware that's 8 bytes of text, at no overhead, awesome.

CompactString is more like a drop-in replacement, it's much bigger, the same size as String, so 24 bytes on modern hardware, but that's all SSO, so text like "This will all fit nicely" fits inline, yet the out-of-line case has the usual affordances such as capacity and length in the data structure. This isn't doing the "Maybe this isn't a pointer?" trick but is instead relying on knowing that the last byte of a UTF-8 string can't have certain values by definition.

[−] SkiFire13 34d ago

My bad I thought it still used unions. I guess I'm not very up to date

[−] tialaramex 36d ago

I realise that I don't do the best job of explaining ColdString here. After all most 8 byte strings of UTF-8 text could equally be a pointer so, why can this work?

All ColdStrings which look like 8 bytes of UTF-8 text really are 8 bytes of UTF-8 text, just the type label on those 8 bytes isn't "[u8; 8]" an array of 8 bytes but instead "mut *u8" a raw pointer. "Validate" for example is 8 bytes of ASCII, thus UTF-8, and Rust is OK with us just saying we want a pointer on a 64-bit machine with those bytes. It's not a valid pointer, but it is a pointer and Rust is OK with that, we just need to be careful never to [unsafely] dereference the pointer because it's invalid

OK, so there are two cases left: First, what if there are fewer bytes of text? Zero even?

Since there are fewer than 8 bytes of text we can use the whole first byte to signal how many of the remainder are text, we use the UTF-8 over-long prefix indicator in which the top five bits of the byte are all set, bytes 0xF8 through 0xFF for this, there are eight of these bytes corresponding to our 8 lengths 0 through 7 inclusive. Because it's over-long this indicator isn't itself a valid UTF-8 prefix. Again we can pretend this is a pointer while knowing it's invalid.

Lastly, the seemingly trickiest problem, what if the string didn't fit inline? We use a heap allocation to store the text prefixed by a variable size integer length and we insist this allocation is aligned to 4 bytes. This means a valid pointer to our allocation has zeroes for the bottom two bits, then we rotate that pointer so those bottom two bits are at the top of the first byte position (depending on machine word layout) and we set the top bit. This is now always invalid UTF-8 because it has the continuation marker - the top bit is set but the next is not, which cannot happen in the first byte of any UTF-8 text, and so our code can detect this and reverse the transformation to get back a valid pointer using the strict provenance APIs if this marker is present.

This type is tomtomwombat's idea, credit to them:

https://github.com/tomtomwombat/cold-string*

[−] amluto 37d ago

> The C and Rust union types are extremely sharp blades

Sure, but the comparable Rust feature is enum, not union.

[−] ks2048 37d ago

I would love to see a page that has cross-language comparisons of how different structures work. Eg “unions: differences between langs. Enums: …” while grouping together the different design choices, as you do in this one case.

I suppose an LLM could do a pretty good job at this.

[−] dgb23 36d ago

I think Zig does a very good illustration of this distinction. They separate:

- enum: essentially just a typed uint tag collection

- union: the plain data structure that can contain any of several types

- tagged union: combines enums and unions, so you can dispatch on its tag to get one of the union types

Read from this section and they appear in order:

https://ziglang.org/documentation/master/#enum

[−] tialaramex 37d ago

I don't love OneOrMore

It's trying to generalize - we might have exactly one T, fine, or a collection of T, and that's more T... except no, the collection might be zero of them, not at least one and so our type is really "OneOrMoreOrNone" and wow, that's just maybe some T.

[−] merb 37d ago

OneOrMore is more or less an example from the functional world. i.e.:

https://hackage.haskell.org/package/oneormore or scala: https://typelevel.org/cats/datatypes/nel.html

it's for type purists, because sometimes you want the first element of the list but if you do that you will get T? which is stupid if you know that the list always holds an element, because now you need to have an unnecessary assertion to "fix" the type.

[−] dasyatidprime 37d ago

The NonEmptyList in Cats is a product (struct/tuple) type, though; I assume the Haskell version is the same. The type shown in the blog post is a sum (union) type which can contain an empty enumerable, which contradicts the name OneOrMore. The use described for the type in the post (basically a convenience conversion funnel) is different and makes sense in its own right (though it feels like kind of a weak use case). I'm not sure what a good name would've been illustratively that would've been both accurate and not distracting, though.

[−] merb 37d ago

Well you are right of course, I just wanted to explain what they wanted to show. Of course the type would be wrong if the second entry in itself is an empty list. I just wanted to explain the reasoning what they tried to accomplish

They could’ve done the Either type which would’ve been more correct or maybe EitherT (if the latter is even possible)

[−] dasyatidprime 37d ago

I don't think they were trying to accomplish the same thing as the Scala/Haskell version; these are just two completely different things that happen to share a name because the blog post gave the example a name that is confusing when read literally. The purpose of the Cats version is “there is always a head element”. The purpose of the union in the blog post is more like “this can be a collection, but many callers will be thinking of it as a single element, so don't put the burden on them to convert it”. I do think it's a weak case for them in a type theory sense (I would tend to position that kind of implicit conversion elsewhere in the language), but I can also see it being motivating to a large class of developers…

… wait, I've made a different mistake here while trying to explain the difference, haven't I? I was describing it as a sum type, but it's not really a sum type, it's really just set-theoretic union, right?

Which also means OneOrMore is unsound in a different way because it doesn't guarantee that T and IEnumerable are disjoint; OneOrMore