When you have so few bits, does it really make sense to invent a meaning for the bit positions? Just use an index into a "palette" of pre-determined numbers.
As a bonus, any operation can be replaced with a lookup into a nxn table.
In standard FP32, the infs are represented as a sign bit, all exponent bits=1, and all mantissa bits=0. The NaNs are represented as a sign bit, all exponent bits=1, and the mantissa is non-zero. If you used that interpretation with FP4, you'd get the table below, which restricts the representable range to +/- 3, and it feels less useful to me. If you're using FP4 you probably are space optimized and don't want to waste a quarter of your possible combinations on things that aren't actually numbers, and you'd likely focus your efforts on writing code that didn't need to represent inf and NaN.
You need it if you want the idea of total ordering over the extended Reals. There's +/- infinity--an affine closure, not projective (point at infinity)--so to make that math work you need to give 0 a sign.
That sounds pretty niche. What's a use case where you have less than 8 bits and that distinction is more important than having an extra finite value? I don't think AI is one.
For neural net gradient descent, automatic differentiation etc, the widely used ReLU function has infornation carrying derivatives at +0 and –0 if those are infinitesimals.
Barely any information. After surviving RELU that signed zero is probably getting added to another value and then oops the information is gone. It sounds a lot worse than properly spaced values.
If you were looking at the entire number line, sign would roughly be the most important part.
But you still have all the other numbers carrying sign info. This is only the sign of denormals and that's way less valuable. Outside of particular equations it ends up added to something else and disappearing entirely. It would be way better to cut it and have either half the smallest existing positive value or double the largest existing value as a replacement. Or many other options.
For FP4, yes... sometimes... it depends. But newer Nvidia architecture eg Blackwell w/ NVFP4 does not, they perform micro block scaling in the core. For older architectures, low quants like FP4 are also often not done native, and instead inflated back to BF16, eg with BnB.
As explained in an article linked at the bottom of TFA, the weights of a LLM have a normal (Gaussian) distribution.
Because of that, the best compromise when the weights are quantized to few levels is to place the points encoded by the numeric format used for the weights using a Gaussian function, instead of placing them uniformly on a logarithmic scale, like the usual floating-point formats attempt.
There is a relevant Wikipedia page about minifloats [0]
> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.
> In ancient times, floating point numbers were stored in 32 bits.
This was true only for cheap computers, typically after the mid sixties.
Most of the earliest computers with vacuum tubes used longer floating-point number formats, e.g. 48-bit, 60-bit or even weird sizes like 57-bit.
The 32-bit size has never been acceptable in scientific computing with complex computations where rounding errors accumulate. The early computers with floating-point hardware were oriented to scientific/technical computing, so bigger number sizes were preferred. The computers oriented to business applications usually preferred fixed-point numbers.
The IBM System/360 family has definitively imposed the 32-bit single-precision and 64-bit double-precision sizes, where 32-bit is adequate for input data and output data and it can be sufficient for intermediate values when the input data passes through few computations, while otherwise double-precision must be used.
A few years after 1980, especially after 1985, the computers with coprocessors like Intel 8087 or Motorola 68881 became the most numerous computers with floating-point hardware, and for them the default FP size was 80-bit.
So the 1990s were long after the time when 32-bit FP numbers were normal. FP32 was revived only by GPUs, for graphic applications where precision matters much less.
Already after 1974, the C programming language made double-precision the default FP size, not the 32-bit single-precision size, for the same reason why Intel 8087 introduced extended precision. Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
While before C the programming languages used single-precision 32-bit numbers as the default size, the recommendations were already to use only double-precision wherever complicated expressions were computed.
I have started using computers by punching cards for a mainframe, but that was already at a time when 32-bit FP numbers were not normally used, but only 64-bit FP numbers.
The best chances of seeing 32-bit single-precision numbers in use was in the decade from 1965 to 1975, at the users of cheap mainframes or of minicomputers without hardware floating-point units, where floating-point emulation was done in software and emulating double-precision was significantly slower.
Before the mid sixties, there were more chances to see 36-bit floating-point numbers as the smallest FP size.
Yeah. I know. I'm not disagreeing with your diagnosis, I'm just trying to gently rib you that your correction is misaimed. It's a joke, ya know?
>Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
Lots of ordinary computer users did compute in single precision! The reason I picked the 1990s as 'ancient' and not 1980 (when the 8087 was taped out) or 1985 (when IEEE754 was finally approved) was because those microprocessors were now in the hands of users who weren't under the supervision of 'experts'. That, along with the lack of fast 64 bit registers + the desire for high throughput at low fidelity led to a lot of 32 bit code!
And, frankly, if you want to get real technical, the ability of non-experts to program in FP in 64 bit is enforced NOT ONLY by the doubled bits but by the implicit ability (absent now in many implementations) to use the 80 bit extended precision format for intermediate calcs. It's the added bits in that format for scratch that let lots of 64 bit programs just work.
Even the latest CPUs have a 2:1 fp64:fp32 performance ratio - plus the effects of 2x the data size in cache and bandwidth use mean you can often get greater than a 2x difference.
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio
Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs
If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
Well, maybe not all admittedly, and I didn't look at AVX2/512, but it looks like _mm_div_ps and _mm_div_pd are identical for divide, at the 4-wide level for the basics.
Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.
My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.
This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.
Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".
That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".
But the "float" typename is generally fp32 - if we assume the "most generically named type" is the "default". Though this is a bit of an inconsistency with C - the type name "double" surely implies it's double the expected baseline while, as you mentioned, constants and much of libm default to 'double'.
The C keywords "float" and "double" are based on the tradition established a decade earlier by IBM System/360 of calling FP32 as "single-precision" and FP64 as "double-precision".
This IBM convention has been inherited by the IBM programming languages FORTRAN IV and PL/I and from these 2 languages it has spread everywhere.
The C language has taken several keywords and operators from IBM PL/I, which was one of the three main inspiration sources for C (which were CPL/BCPL, PL/I and ALGOL 68).
So "float" and "double" are really inherited by C from PL/I.
A feature that is specific to C is that it has changed the default format for constants and for intermediate values to double-precision, instead of the single-precision that was the default in earlier programming languages.
This was done with the intention of protecting naive users from making mistakes, because if you compute with FP32 it is very easy to obtain erroneous results, unless you analyze very carefully the propagation of errors. Except in applications where errors matter very little, e.g. graphics and ML/AI, the use of FP32 is more suitable for experts, while bigger formats are recommended for normal users.
> 9 years ago, I shared this as an April Fools joke here on HN.
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
Interesting! I have been using integers or f32 for that. What was the use case specifically? Did you write a software float for it? I remember writing a f16 type for an IC that used that was a pain!
0.0 + x = x
NaN + x = NaN
+1.0 + -1.0 = 0.0
+1.0 + +1.0 = NaN
-1.0 + -1.0 = NaN
-0.0 = 0.0
-(+1.0) = -1.0
-(-1.0) = +1.0
-NaN = NaN
x - y = x + (-y)
NaN * x = NaN
+1.0 * x = x
-1.0 * x = -x
0.0 * 0.0 = 0.0
/0.0 = NaN
/+1.0 = +1.0
/-1.0 = -1.0
/NaN = NaN
x / y = x * (/y)
More interestingly, how to implement in logic gates. Addition with a 2's complement full adder and NaN detector. Negation with a 2's complement negation circuit. Reciprocal with a 0.0 detector.
Multiplication with a unique logic circuit (use a Karnaugh map):
I'll use custom notation =? ≤≥? ≤? for comparison to distinguish from = < ≤.
x =? x = True
Otherwise, a =? b = False
NaN ≤≥? NaN = False
Otherwise, a ≤≥? b = a =? b
-1.0 0.0 = True
-1.0 +1.0 = True
0.0 +1.0 = True
Otherwise, a b = False
a >? b = b a
a ≤? b = (a b | a ≤≥? b)
a ≥? b = (a >? b | a ≤≥? b)
In logic gates: For =?, bitwise equality. For ≤≥?, bitwise equality and a NaN detector. For , use:
ab cd = a&b&~c | ~a&~b&~c&d
I separate =? from ≤≥?. =? compares value, while ≤≥? compares order. NaN has no ordering, so it compares false. IEEE float only uses ≤≥? and names it ==.
It's better to first show truth tables, then K-maps, and only then logical formulas.
But the main question is: does this FP2 have any real applications? Maybe it could be useful when only one operand is FP2? Especially for vectorized math.
I'm just having fun. I wrote out the full truth tables and Karnaugh maps on paper, but I trust that you get the idea and can recreate it yourself. (Or, I can write a more detailed blog post, if you'd find that interesting.)
If I had to guess, we could use this for a very compact output of the sign function. [-Inf,0) maps to -1.0, 0 maps to 0.0, (0,Inf] maps to +1.0, and NaN maps to NaN. I don't know what application would need the sign function, though. I haven't needed it yet in my programming experience.
There's an "Update:" note for a next post on NF4 format. As far as I can tell this is neither NVFP4 nor MXFP4 which are commonly used with LLM model files. The thing with these formats is that common information is separated in batches so not a singular format but a format for groups of values. I'd like to know more about these (but not enough to go research them myself).
I too want fewer bits of mantissa in my floating point!
But what I wish is that there had been fp64 encoding with a field for number of significant digits.
strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"
Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.
Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").
I can't fix this in one class -- data travels between too many languages and communication buffers.
In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.
> It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
It would be useful if you could then pass it to an "about equal" operator, too.
I don't need to know that the alternator is putting out 13.928528V, and sure as hell I know you're not measuring that accurately. It's precise but wrong.
I want an "about equals" thing so I can say "if Valt == 14 alt_ok=true" kind of thing but tag it to be "about 14" not "exactly 14".
The earliest Cray models (starting with Cray-1 in 1976) had only 64-bit floating-point numbers. 128-bit numbers were a later addition and I do not think that they were implemented in hardware, but only in software. Very few computers, except some from IBM, have implemented FP128 in hardware, while software libraries for quadruple-precision or double-double-precision FP128 are widespread.
The Cray 64-bit format was a slight increase in size over the 60-bit floating-point numbers that had been used in the previous computers designed by Seymour Cray, at CDC.
Before IBM increased the size of a byte to 8 bits, which caused all numeric formats to use sizes that are multiple of 8-bits, in the computers with 6-bit bytes the typical floating-point number sizes were either 60-bit in the high-end models or 48-bit in cheaper models or 36-bit in the cheapest models.
> In ancient times, floating point numbers were stored in 32 bits.
I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.
I was going to reply that just because intel did something funny doesn't mean that it was the beginning of the story. but it turns out that the release of the 8087 predates the ratification of IEEE floats by 2 years. in addition, the primary numeric designer for the 8087 was apparently Kahan, which means that they were both part of the same design process. of course there were other formats predating both of these
79 comments
As a bonus, any operation can be replaced with a lookup into a nxn table.
It seems quite wastful to have two zeros when you only have 4 bits it total
But you still have all the other numbers carrying sign info. This is only the sign of denormals and that's way less valuable. Outside of particular equations it ends up added to something else and disappearing entirely. It would be way better to cut it and have either half the smallest existing positive value or double the largest existing value as a replacement. Or many other options.
Because of that, the best compromise when the weights are quantized to few levels is to place the points encoded by the numeric format used for the weights using a Gaussian function, instead of placing them uniformly on a logarithmic scale, like the usual floating-point formats attempt.
> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.
[0] https://en.wikipedia.org/wiki/Minifloat
> In ancient times, floating point numbers were stored in 32 bits.
This was true only for cheap computers, typically after the mid sixties.
Most of the earliest computers with vacuum tubes used longer floating-point number formats, e.g. 48-bit, 60-bit or even weird sizes like 57-bit.
The 32-bit size has never been acceptable in scientific computing with complex computations where rounding errors accumulate. The early computers with floating-point hardware were oriented to scientific/technical computing, so bigger number sizes were preferred. The computers oriented to business applications usually preferred fixed-point numbers.
The IBM System/360 family has definitively imposed the 32-bit single-precision and 64-bit double-precision sizes, where 32-bit is adequate for input data and output data and it can be sufficient for intermediate values when the input data passes through few computations, while otherwise double-precision must be used.
I am...very sorry to be the one delivering this news. It was not a pleasant realization for me, either.
So the 1990s were long after the time when 32-bit FP numbers were normal. FP32 was revived only by GPUs, for graphic applications where precision matters much less.
Already after 1974, the C programming language made double-precision the default FP size, not the 32-bit single-precision size, for the same reason why Intel 8087 introduced extended precision. Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
While before C the programming languages used single-precision 32-bit numbers as the default size, the recommendations were already to use only double-precision wherever complicated expressions were computed.
I have started using computers by punching cards for a mainframe, but that was already at a time when 32-bit FP numbers were not normally used, but only 64-bit FP numbers.
The best chances of seeing 32-bit single-precision numbers in use was in the decade from 1965 to 1975, at the users of cheap mainframes or of minicomputers without hardware floating-point units, where floating-point emulation was done in software and emulating double-precision was significantly slower.
Before the mid sixties, there were more chances to see 36-bit floating-point numbers as the smallest FP size.
>Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
Lots of ordinary computer users did compute in single precision! The reason I picked the 1990s as 'ancient' and not 1980 (when the 8087 was taped out) or 1985 (when IEEE754 was finally approved) was because those microprocessors were now in the hands of users who weren't under the supervision of 'experts'. That, along with the lack of fast 64 bit registers + the desire for high throughput at low fidelity led to a lot of 32 bit code!
And, frankly, if you want to get real technical, the ability of non-experts to program in FP in 64 bit is enforced NOT ONLY by the doubled bits but by the implicit ability (absent now in many implementations) to use the 80 bit extended precision format for intermediate calcs. It's the added bits in that format for scratch that let lots of 64 bit programs just work.
> Programmers were grateful for the move from 32-bit floats to 64-bit floats. It doesn’t hurt to have more precision
Someome didn't try it on GPU...
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio
Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs
If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
https://gcc.godbolt.org/z/7155YKTrK
> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
_mm_div_psand_mm_div_pdare identical for divide, at the 4-wide level for the basics.Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.
My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.
That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".
> languages that care about performance to default to fp32
What do you mean by this? In C 1.0 is a double.
This IBM convention has been inherited by the IBM programming languages FORTRAN IV and PL/I and from these 2 languages it has spread everywhere.
The C language has taken several keywords and operators from IBM PL/I, which was one of the three main inspiration sources for C (which were CPL/BCPL, PL/I and ALGOL 68).
So "float" and "double" are really inherited by C from PL/I.
A feature that is specific to C is that it has changed the default format for constants and for intermediate values to double-precision, instead of the single-precision that was the default in earlier programming languages.
This was done with the intention of protecting naive users from making mistakes, because if you compute with FP32 it is very easy to obtain erroneous results, unless you analyze very carefully the propagation of errors. Except in applications where errors matter very little, e.g. graphics and ML/AI, the use of FP32 is more suitable for experts, while bigger formats are recommended for normal users.
> The notation ExMm denotes a format with x exponent bits and y mantissa bits.
Shouldn't that be m mantissa bits (not y) -- i.e. typo here -- or am I misunderstanding something?
It seems that life is imitating art.
https://github.com/sdd/ieee754-rrp
> 9 years ago, I shared this as an April Fools joke here on HN.
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
f16type for an IC that used that was a pain!Yes, purely software.
[1] https://tom7.org/nand/
Multiplication with a unique logic circuit (use a Karnaugh map):
But the main question is: does this FP2 have any real applications? Maybe it could be useful when only one operand is FP2? Especially for vectorized math.
If I had to guess, we could use this for a very compact output of the sign function. [-Inf,0) maps to -1.0, 0 maps to 0.0, (0,Inf] maps to +1.0, and NaN maps to NaN. I don't know what application would need the sign function, though. I haven't needed it yet in my programming experience.
But what I wish is that there had been fp64 encoding with a field for number of significant digits.
strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"
Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.
Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").
I can't fix this in one class -- data travels between too many languages and communication buffers.
In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.
> It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
It would be useful if you could then pass it to an "about equal" operator, too.
I don't need to know that the alternator is putting out 13.928528V, and sure as hell I know you're not measuring that accurately. It's precise but wrong.
I want an "about equals" thing so I can say "if Valt == 14 alt_ok=true" kind of thing but tag it to be "about 14" not "exactly 14".
> In ancient times, floating point numbers were stored in 32 bits. Then somewhere along the way 64 bits became standard.
I think Cray doubles were 128 bits, and their singles were 64… which makes it seem like smaller floats are just a continuation of the eternal trend.
The Cray 64-bit format was a slight increase in size over the 60-bit floating-point numbers that had been used in the previous computers designed by Seymour Cray, at CDC.
Before IBM increased the size of a byte to 8 bits, which caused all numeric formats to use sizes that are multiple of 8-bits, in the computers with 6-bit bytes the typical floating-point number sizes were either 60-bit in the high-end models or 48-bit in cheaper models or 36-bit in the cheapest models.
> In ancient times, floating point numbers were stored in 32 bits.
I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.