My first patch to the Linux kernel

[−] monus 55d ago

> You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn't have any bugs, and that the CPU hardware was faulty.

This matches my experience whenever I do an unconventional or deep work like the article mentions. The engineers comfortable with this type of work will multiply their worth.

[−] surajrmal 53d ago

For now. We are training our replacement. They will learn how to do this sort of work over time, likely shorter than you might expect.

[−] fonheponho 55d ago

Everybody seems to be missing the forest for the trees on this.

There is absolutely no "sign extension" in the C standard (go ahead, search it). "Sign extension" is a feature of some assembly instructions on some architectures, but C has nothing to do with it.

Citing integer promotion from the standard is justified, but it's just one part (perhaps even the smaller part) of the picture. The crucial bit is not quoted in the article: the specification of "Bitwise shift operators". Namely

> The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. [...]

> The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1×2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

What happens here is that "base2" (of type uint8_t, which is "unsigned char" in this environment) gets promoted to "int", and then left-shifted by 24 bits. You get undefined behavior because, while "base2" (after promotion) has a signed type ("int") and nonnegative value, E1×2^E2 (i.e., base2 × 2^24) is NOT representable in the result type ("int").

What happens during the conversion to "uint64_t" afterwards is irrelevant; even the particulars of the sign bit of "int", and how you end up with a negative "int" from the shift, are irrelevant; you got your UB right inside the invalid left-shift. How said UB happens to materialize on this particular C implementation may perhaps be explained in terms of sign extension of the underlying ISA -- but do that separately; be absolutely clear about what is what.

The article fails to mention the root cause (violating the rules for the bitwise left-shift operator) and fails to name the key consequence (undefined behavior); instead, it leads with not-a-thing ("sign-extension bug in C"). I'm displeased.

BTW this bug (invalid left shift of a signed integer) is common, sadly.

[−] adrian_b 55d ago

The root problem is actually that the C language allows implicit conversions from an unsigned type to a signed type and from a signed type to an unsigned type, and in certain contexts such implicit conversions are actually mandated by the standard, like in the buggy expression from the parent article.

It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.

Saying that the behavior is sometimes undefined is not acceptable. Any implicit conversion of this kind must be an error. Whenever a conversion between signed and unsigned or unsigned and signed is desired, it must be explicit.

This may be the worst mistake that has ever been made in the design of the C language and it has not been corrected even after 50 years.

Making this an error would indeed produce a deluge of error messages in many carelessly written legacy programs, but the program conversion is trivial and it is extremely likely that many of these cases where the compilers do not signal errors can cause bugs in certain corner cases, like in the parent article.

[−] zahlman 55d ago

> It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.

Hmm? Seems to me that unsigned -> larger signed works, although other conversions may not.

But yes, I generally agree that these are terrible conversions to do implicitly, given that the entire point of those types is to control the interpretation of memory at a bits-and-bytes level. Languages where implicit numeric conversions make sense are generally not languages that care so much about integer size, and the entire point of having unsigned types is to bake that range constraint in.

[−] uecker 55d ago

You could just use -Wsign-conversion.

[−] adrian_b 55d ago

Obviously, that should always be used, like also the compiler options for checking integer overflow and accesses out-of-bounds.

However, this kind of implicit conversions must really be forbidden in the standard, because the correct program source is different from the one permitted by the standard.

When you activate most compiler options that detect undefined behaviors, the correct program source remains the same, even if the compiler now implements a better behavior for the translated program than the minimal behavior specified by the standard.

That happens because most undefined behaviors are detected at run time. On the other hand, incorrect implicit conversions are a property of the source code, which is always detected during compilation, so such programs must be rejected.

[−] gus_massa 55d ago

Integer overflow and accesses out-of-bounds must be checked at runtime that makes the program slower. It looks like -Wsign-conversion can be checked at compilation time, perhaps with a few false positives where the numbers are "always" small enough.

Does it also complain when the assigned variable is big enough to avoid the problem? Does the compiler generate slower code with the explicit conversions?

It looks like an nice task to compile major projects with -Wsign-conversion and send PR fixing the warnings. (Assuming they are only a few, let's say 5. Sending an uninvited PR with a thousand changes will make the maintainers unhappy.)

[−] uecker 55d ago

The standard will not forbid anything that breaks billions of lines of code still be used and maintained.

But it is easy enough to use modern tooling and coding styles to deal with signed overflow. Nowadays, silent unsigned wrap around causing logic errors is the more vexing issue, which indicates the undefined behavior actually helps rather than hurts when used with good tooling.

[−] fonheponho 55d ago

> It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.

It's not that bad actually; not "always". The only nontrivial case is when, as a part of the usual arithmetic conversions, you (perhaps unwittingly) convert a signed integer type to an unsigned integer type [*], and the original value was negative.

[*] This can happen in two cases (paraphrasing the standard):

- if the operand that has unsigned integer type has rank greater than or equal to the rank of the signed integer type of the other operand,

- if the operand that has signed integer type has rank greater than or equal to the rank of the unsigned integer type of the other operand, but the signed integer type cannot represent all values of the unsigned integer type.

Examples: (a) "unsigned int" vs. "signed int"; (b) "long signed int" vs. "unsigned int" in a POSIX ILP32 programming environment. Under (a), you get conversion to "unsigned int"; under (b), you get conversion (for both operands) to "long unsigned int".

Section "3.2 Conversions | 3.2.1 Arithmetic operands | 3.2.1.1 Characters, and integers" in the C89 Rationale <https://www.open-std.org/Jtc1/sc22/WG14/www/C89Rationale.pdf> is worth reading. (An updated version of the same section is included in the C99 Rationale <https://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.1...> under 6.3.1.1.)

It deals precisely with the problem highlighted in the blog post. I'll quote just the beginning and the end:

> Since the publication of K&R, a serious divergence has occurred among implementations of C in the evolution of integral promotion rules. Implementations fall into two major camps, which may be characterized as unsigned preserving and value preserving. [...]

> The unsigned preserving rules greatly increase the number of situations where unsigned int confronts signed int to yield a questionably signed result, whereas the value preserving rules minimize such confrontations. Thus, the value preserving rules were considered to be safer for the novice, or unwary, programmer. After much discussion, the Committee decided in favor of value preserving rules, despite the fact that the UNIX C compilers had evolved in the direction of unsigned preserving.

> QUIET CHANGE -- A program that depends upon unsigned preserving arithmetic conversions will behave diﬀerently, probably without complaint. This is considered the most serious semantic change made by the Committee to a widespread current practice.

[−] Arch-TK 55d ago

It's incredibly common for people talking about C online or even in books (be that blog posts, side notes, tutorials, guides) to constantly make mistakes like these.

C seems to be one of those languages where people think they know it based on prior and adjacent experience. But it is not a language which can be learned based on experience alone. The language is full of cases where things will go badly wrong in a way which is neither obvious nor immediately evident. The negative side effects of what you did often only become evident long after you "learn" it as something you "can" do.

If you want to write C for anything where any security, safety, or reliability requirement needs to be met, you should commit to this strategy: Do not write any code which you are not absolutely certain you could justify the behaviour of by referencing the standard or (in the case of reliance on a specific definition of implementation defined, unspecified, or even (e.g. -ftrapv) undefined behaviour) the implementation documentation.

If you cannot commit to such a (rightfully mentally arduous) policy, you have no business writing C.

The same can actually be applied to C++ and Bash.

[−] msichert 55d ago

That's very interesting, I'm only familiar with the C++ standard where bit shifts are defined in terms of multiplications and divisions by powers of 2: https://eel.is/c++draft/expr.shift

So it seems in regard to bit shifts, C++ behaves slightly differently (it seems to have less UB) than C.

[−] manwe150 55d ago

It was implementation defined for shifting negative numbers, but now the standard specifies twos-complement for this and all related IB

[−] qingcharles 55d ago

Fortunately, the solution should be valid for all circumstances, though the working out might have gone a bit astray.

[−] ashwinnair99 55d ago

The first one always takes way longer than the code itself deserves. Most of the work is figuring out the unwritten rules, not writing the patch.

[−] ngburke 55d ago

Sign extension bugs are the worst. Silent for ages then suddenly everything is on fire. Spent a lot of time in C doing low-level firmware work and ran into the same class of issue more than once. Nice writeup, congrats on the patch.

[−] knorker 55d ago

Integer promotion rules in C are so deceptive.

I don't believe there's anybody who can reason about them at code skimming speeds. It's probably the best place to hide underhanded code.

[−] ozgrakkurt 55d ago

Great blog post. Using _BitInt typedefs for integers is a good option for anyone starting a fresh c project. It has worked well for me so far. _BitInt integers don’t promote to signed automatically like regular integers in c

[−] TacticalCoder 55d ago

A big thanks for making the Linux kernel better!

> Since virtualization is hardware assisted these days

I was running Xen with full-hardware virtualization on consumer hardware in... 2006. I mean: some of us here were running hardware virt before some of the commenters were born. Just to put the "these days" into perspective in case some would be thinking it's a new thing.

[−] foltik 55d ago

Well done and great writeup! Any idea why the bug hadn’t shown up sooner, like when running self tests?

[−] NotCamelCase 55d ago

Lovely article with a happy ending!

One thing that I am glad to have been taught early on in my career when it comes to debugging, especially anything involving HW, is to `make no assumptions'. Bugs can be anywhere and everywhere.

[−] dingensundso 55d ago

Nice blogpost. Was an really interesting read. Would be interesting to read about the experience of getting the patch accepted and merged.

One thing I noticed: The last footnote is missing.

[−] himata4113 55d ago

I am just an awe after reading: "The motherboard would be stuck in a zombie state" because that's EXACTLY what happened to me with normal KVM and QEMU (with modifications)! I kinda just pulled the plug and continued working never to have this resurface again... until I continued reading... I thought I was doing something wrong in the user-land, turns out it was UB sign shift all along.

[−] yu3zhou4 55d ago

Congrats and happy for you, you had a lot of fun and did something genuinely interesting

[−] siddyboi 55d ago

Huge congrats on tracking that down and getting your first Linux kernel patch merged!

[−] kwar13 55d ago

Huge congrats. One of the highest achievements as a programmer in my book!

[−] mbana 55d ago

I love these kind of posts.

[−] TheOpenSourcer 55d ago

Welcome to the club my friend! Its very exited. Very soon you will choose your favorate subsystem and double down on it.

[−] leontloveless 55d ago

[dead]

[−] algolint 55d ago

[flagged]

My first patch to the Linux kernel (pooladkhay.com)

56 comments