4-bit floating point FP4 (johndcook.com)

by chmaynard 79 comments 98 points
Read article View on HN

79 comments

[−] teo_zero 26d ago
When you have so few bits, does it really make sense to invent a meaning for the bit positions? Just use an index into a "palette" of pre-determined numbers.

As a bonus, any operation can be replaced with a lookup into a nxn table.

[−] petters 25d ago
That's a good idea and it exists: https://www.johndcook.com/blog/2026/04/18/qlora/

It seems quite wastful to have two zeros when you only have 4 bits it total

[−] saulpw 25d ago
OTOH, it seems quite plausible that the most important numbers to represent are:

   +0
   -0
   +1
   -1
   +inf
   -inf
[−] parsimo2010 25d ago
In standard FP32, the infs are represented as a sign bit, all exponent bits=1, and all mantissa bits=0. The NaNs are represented as a sign bit, all exponent bits=1, and the mantissa is non-zero. If you used that interpretation with FP4, you'd get the table below, which restricts the representable range to +/- 3, and it feels less useful to me. If you're using FP4 you probably are space optimized and don't want to waste a quarter of your possible combinations on things that aren't actually numbers, and you'd likely focus your efforts on writing code that didn't need to represent inf and NaN.

  Bits s exp m  Value
  -------------------
  0000 0  00 0     +0
  0001 0  00 1   +0.5
  0010 0  01 0     +1
  0011 0  01 1   +1.5
  0100 0  10 0     +2
  0101 0  10 1     +3
  0110 0  11 0     +inf
  0111 0  11 1     NaN
  1000 1  00 0     -0
  1001 1  00 1   -0.5
  1010 1  01 0     -1
  1011 1  01 1   -1.5
  1100 1  10 0     -2
  1101 1  10 1     -3
  1110 1  11 0     -inf
  1111 1  11 1     NaN
[−] saulpw 24d ago
I can see the most important values being:

   ± 0 (infinitesimal)
   ± 10^-2n
   ± 10^-n
   ± 1 (unity)
   ± 10^n
   ± 10^2n
   ± infinity
For fp4, this leaves 2 values. Maybe one of them should be NaN. What should the other one be?
[−] Dwedit 25d ago
Why waste a slot on -0?
[−] adampunk 25d ago
You need it if you want the idea of total ordering over the extended Reals. There's +/- infinity--an affine closure, not projective (point at infinity)--so to make that math work you need to give 0 a sign.
[−] saulpw 25d ago
Because it means "infinitesimal negative" which is distinct from "infinitesimal positive".
[−] Dylan16807 25d ago
That sounds pretty niche. What's a use case where you have less than 8 bits and that distinction is more important than having an extra finite value? I don't think AI is one.
[−] jlokier 25d ago
For neural net gradient descent, automatic differentiation etc, the widely used ReLU function has infornation carrying derivatives at +0 and –0 if those are infinitesimals.
[−] Dylan16807 25d ago
Barely any information. After surviving RELU that signed zero is probably getting added to another value and then oops the information is gone. It sounds a lot worse than properly spaced values.
[−] saulpw 25d ago
sign = most important bit of information
[−] Dylan16807 25d ago
If you were looking at the entire number line, sign would roughly be the most important part.

But you still have all the other numbers carrying sign info. This is only the sign of denormals and that's way less valuable. Outside of particular equations it ends up added to something else and disappearing entirely. It would be way better to cut it and have either half the smallest existing positive value or double the largest existing value as a replacement. Or many other options.

[−] 0-_-0 26d ago
You want to make multiplication cheap, it's not just about compression
[−] mysterydip 26d ago
Wouldn’t multiplication just be an 8 bit lookup table? a*b is just lut[a<<4+b]
[−] 0-_-0 24d ago
A 256 element lookup table is much bigger than a simple multiplier
[−] kevmo314 25d ago
Multiplication at this resolution is already implemented via lookup tables.
[−] ineedasername 25d ago
For FP4, yes... sometimes... it depends. But newer Nvidia architecture eg Blackwell w/ NVFP4 does not, they perform micro block scaling in the core. For older architectures, low quants like FP4 are also often not done native, and instead inflated back to BF16, eg with BnB.
[−] londons_explore 23d ago
Specifically, you want to choose 16 values, all of which you can multiply an activation value by using circuitry which is as small as possible.
[−] childintime 26d ago
Exactly. And pick them on the e^x curve.
[−] adrian_b 25d ago
As explained in an article linked at the bottom of TFA, the weights of a LLM have a normal (Gaussian) distribution.

Because of that, the best compromise when the weights are quantized to few levels is to place the points encoded by the numeric format used for the weights using a Gaussian function, instead of placing them uniformly on a logarithmic scale, like the usual floating-point formats attempt.

[−] conaclos 26d ago
There is a relevant Wikipedia page about minifloats [0]

> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.

[0] https://en.wikipedia.org/wiki/Minifloat

[−] adrian_b 25d ago

> In ancient times, floating point numbers were stored in 32 bits.

This was true only for cheap computers, typically after the mid sixties.

Most of the earliest computers with vacuum tubes used longer floating-point number formats, e.g. 48-bit, 60-bit or even weird sizes like 57-bit.

The 32-bit size has never been acceptable in scientific computing with complex computations where rounding errors accumulate. The early computers with floating-point hardware were oriented to scientific/technical computing, so bigger number sizes were preferred. The computers oriented to business applications usually preferred fixed-point numbers.

The IBM System/360 family has definitively imposed the 32-bit single-precision and 64-bit double-precision sizes, where 32-bit is adequate for input data and output data and it can be sufficient for intermediate values when the input data passes through few computations, while otherwise double-precision must be used.

[−] adampunk 25d ago
You are totally correct but I need you to recognize that "in ancient times" includes the 1990s.

I am...very sorry to be the one delivering this news. It was not a pleasant realization for me, either.

[−] adrian_b 25d ago
A few years after 1980, especially after 1985, the computers with coprocessors like Intel 8087 or Motorola 68881 became the most numerous computers with floating-point hardware, and for them the default FP size was 80-bit.

So the 1990s were long after the time when 32-bit FP numbers were normal. FP32 was revived only by GPUs, for graphic applications where precision matters much less.

Already after 1974, the C programming language made double-precision the default FP size, not the 32-bit single-precision size, for the same reason why Intel 8087 introduced extended precision. Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.

While before C the programming languages used single-precision 32-bit numbers as the default size, the recommendations were already to use only double-precision wherever complicated expressions were computed.

I have started using computers by punching cards for a mainframe, but that was already at a time when 32-bit FP numbers were not normally used, but only 64-bit FP numbers.

The best chances of seeing 32-bit single-precision numbers in use was in the decade from 1965 to 1975, at the users of cheap mainframes or of minicomputers without hardware floating-point units, where floating-point emulation was done in software and emulating double-precision was significantly slower.

Before the mid sixties, there were more chances to see 36-bit floating-point numbers as the smallest FP size.

[−] adampunk 24d ago
Yeah. I know. I'm not disagreeing with your diagnosis, I'm just trying to gently rib you that your correction is misaimed. It's a joke, ya know?

>Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.

Lots of ordinary computer users did compute in single precision! The reason I picked the 1990s as 'ancient' and not 1980 (when the 8087 was taped out) or 1985 (when IEEE754 was finally approved) was because those microprocessors were now in the hands of users who weren't under the supervision of 'experts'. That, along with the lack of fast 64 bit registers + the desire for high throughput at low fidelity led to a lot of 32 bit code!

And, frankly, if you want to get real technical, the ability of non-experts to program in FP in 64 bit is enforced NOT ONLY by the doubled bits but by the implicit ability (absent now in many implementations) to use the 80 bit extended precision format for intermediate calcs. It's the added bits in that format for scratch that let lots of 64 bit programs just work.

[−] chrisjj 26d ago

> Programmers were grateful for the move from 32-bit floats to 64-bit floats. It doesn’t hurt to have more precision

Someome didn't try it on GPU...

[−] kimixa 26d ago
Even the latest CPUs have a 2:1 fp64:fp32 performance ratio - plus the effects of 2x the data size in cache and bandwidth use mean you can often get greater than a 2x difference.

If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P

[−] pixelesque 26d ago

> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio

Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).

Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.

[−] omoikane 26d ago

> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs

If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.

https://gcc.godbolt.org/z/7155YKTrK

[−] kimixa 26d ago

> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...

That.... doesn't seem true? At least for most architectures I looked at?

While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.

And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?

Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.

So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.

[−] pixelesque 26d ago
Well, maybe not all admittedly, and I didn't look at AVX2/512, but it looks like _mm_div_ps and _mm_div_pd are identical for divide, at the 4-wide level for the basics.

Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.

My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.

[−] adgjlsfhk1 26d ago
This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.
[−] kimixa 26d ago
Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".

That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".

[−] adgjlsfhk1 26d ago

> languages that care about performance to default to fp32

What do you mean by this? In C 1.0 is a double.

[−] kimixa 26d ago
But the "float" typename is generally fp32 - if we assume the "most generically named type" is the "default". Though this is a bit of an inconsistency with C - the type name "double" surely implies it's double the expected baseline while, as you mentioned, constants and much of libm default to 'double'.
[−] adrian_b 25d ago
The C keywords "float" and "double" are based on the tradition established a decade earlier by IBM System/360 of calling FP32 as "single-precision" and FP64 as "double-precision".

This IBM convention has been inherited by the IBM programming languages FORTRAN IV and PL/I and from these 2 languages it has spread everywhere.

The C language has taken several keywords and operators from IBM PL/I, which was one of the three main inspiration sources for C (which were CPL/BCPL, PL/I and ALGOL 68).

So "float" and "double" are really inherited by C from PL/I.

A feature that is specific to C is that it has changed the default format for constants and for intermediate values to double-precision, instead of the single-precision that was the default in earlier programming languages.

This was done with the intention of protecting naive users from making mistakes, because if you compute with FP32 it is very easy to obtain erroneous results, unless you analyze very carefully the propagation of errors. Except in applications where errors matter very little, e.g. graphics and ML/AI, the use of FP32 is more suitable for experts, while bigger formats are recommended for normal users.

[−] Sharlin 26d ago
Yeah, and even on CPU using doubles is almost unheard of in many fields.
[−] Figs 26d ago

> The notation ExMm denotes a format with x exponent bits and y mantissa bits.

Shouldn't that be m mantissa bits (not y) -- i.e. typo here -- or am I misunderstanding something?

[−] recursivecaveat 26d ago
You're correct yeah, 'ExMy'.
[−] sc0ttyd 26d ago
9 years ago, I shared this as an April Fools joke here on HN.

It seems that life is imitating art.

https://github.com/sdd/ieee754-rrp

[−] Dylan16807 26d ago

> 9 years ago, I shared this as an April Fools joke here on HN.

That's fun.

> It seems that life is imitating art.

You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.

The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.

[−] nomel 26d ago
Lowest I've used is 8 bit floats for time delays, in embedded devices.
[−] the__alchemist 26d ago
Interesting! I have been using integers or f32 for that. What was the use case specifically? Did you write a software float for it? I remember writing a f16 type for an IC that used that was a pain!
[−] nomel 25d ago
Tight memory constraint. I was putting configuration somewhere it shouldn't have been, but it meant we didn't need to buy an extra chip.

Yes, purely software.

[−] mysterydip 26d ago
I especially like your HQQ precision
[−] sc0ttyd 26d ago
I think it is only a matter of time before HQQ / 1FP takes over. It's the logical conclusion. I hope to be using my 96-blade razor by then too
[−] lifthrasiir 26d ago
Another attempt includes Tom 7's binary3 format [1].

[1] https://tom7.org/nand/

[−] nivertech 26d ago
FP2 spec:

  00 -> 0.0
  01 -> 1.0
  10 -> Inf
  11 -> NaN
or

  00 -> 0.0
  01 -> 1.0
  10 -> Inf
  11 -> -Inf
[−] 0-_-0 26d ago

  00 -> 0.0
  01 ->-0.0
  10 -> Inf
  11 -> -Inf
[−] amavect 24d ago

  00 ->  0.0
  01 -> +1.0
  10 ->  NaN
  11 -> -1.0
Arithmetic:

  0.0 + x = x
  NaN + x = NaN
  +1.0 + -1.0 = 0.0
  +1.0 + +1.0 = NaN
  -1.0 + -1.0 = NaN
  
  -0.0 = 0.0
  -(+1.0) = -1.0
  -(-1.0) = +1.0
  -NaN = NaN
  
  x - y = x + (-y)
  
  NaN * x = NaN
  +1.0 * x = x
  -1.0 * x = -x
  0.0 * 0.0 = 0.0
  
  /0.0 = NaN
  /+1.0 = +1.0
  /-1.0 = -1.0
  /NaN = NaN
  
  x / y = x * (/y)
More interestingly, how to implement in logic gates. Addition with a 2's complement full adder and NaN detector. Negation with a 2's complement negation circuit. Reciprocal with a 0.0 detector.

Multiplication with a unique logic circuit (use a Karnaugh map):

  (ab * cd) = (a&~b | c&~d | ~a&b&c | a&~c&d)(b & d)
[−] nivertech 24d ago
What about comparison operators?
[−] amavect 22d ago
I'll use custom notation =? ≤≥?
  x =? x = True
  Otherwise, a =? b = False
  
  NaN ≤≥? NaN = False
  Otherwise, a ≤≥? b = a =? b
  
  -1.0 ? b = b ? b | a ≤≥? b)
In logic gates: For =?, bitwise equality. For ≤≥?, bitwise equality and a NaN detector. For
  ab 
I separate =? from ≤≥?. =? compares value, while ≤≥? compares order. NaN has no ordering, so it compares false. IEEE float only uses ≤≥? and names it ==.
[−] nivertech 22d ago
It's better to first show truth tables, then K-maps, and only then logical formulas.

But the main question is: does this FP2 have any real applications? Maybe it could be useful when only one operand is FP2? Especially for vectorized math.

[−] amavect 22d ago
I'm just having fun. I wrote out the full truth tables and Karnaugh maps on paper, but I trust that you get the idea and can recreate it yourself. (Or, I can write a more detailed blog post, if you'd find that interesting.)

If I had to guess, we could use this for a very compact output of the sign function. [-Inf,0) maps to -1.0, 0 maps to 0.0, (0,Inf] maps to +1.0, and NaN maps to NaN. I don't know what application would need the sign function, though. I haven't needed it yet in my programming experience.

[−] tim333 25d ago
I guess my first car's four speed box was a bit like a FP2 float. Lever forward/back, right/left -> 3.65, 2.15, 1.42, 1.00 ratios.
[−] karmakaze 26d ago
There's an "Update:" note for a next post on NF4 format. As far as I can tell this is neither NVFP4 nor MXFP4 which are commonly used with LLM model files. The thing with these formats is that common information is separated in batches so not a singular format but a format for groups of values. I'd like to know more about these (but not enough to go research them myself).
[−] FarmerPotato 25d ago
I too want fewer bits of mantissa in my floating point!

But what I wish is that there had been fp64 encoding with a field for number of significant digits.

strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.

Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"

Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.

Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").

I can't fix this in one class -- data travels between too many languages and communication buffers.

In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.

[−] slwvx 25d ago
I've seen more packages that do interval arithmetic than those which keep track of significant digits. For example: https://github.com/JuliaIntervals/IntervalArithmetic.jl
[−] ErroneousBosh 25d ago

> It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.

It would be useful if you could then pass it to an "about equal" operator, too.

I don't need to know that the alternator is putting out 13.928528V, and sure as hell I know you're not measuring that accurately. It's precise but wrong.

I want an "about equals" thing so I can say "if Valt == 14 alt_ok=true" kind of thing but tag it to be "about 14" not "exactly 14".

[−] bee_rider 25d ago

> In ancient times, floating point numbers were stored in 32 bits. Then somewhere along the way 64 bits became standard.

I think Cray doubles were 128 bits, and their singles were 64… which makes it seem like smaller floats are just a continuation of the eternal trend.

[−] adrian_b 25d ago
The earliest Cray models (starting with Cray-1 in 1976) had only 64-bit floating-point numbers. 128-bit numbers were a later addition and I do not think that they were implemented in hardware, but only in software. Very few computers, except some from IBM, have implemented FP128 in hardware, while software libraries for quadruple-precision or double-double-precision FP128 are widespread.

The Cray 64-bit format was a slight increase in size over the 60-bit floating-point numbers that had been used in the previous computers designed by Seymour Cray, at CDC.

Before IBM increased the size of a byte to 8 bits, which caused all numeric formats to use sizes that are multiple of 8-bits, in the computers with 6-bit bytes the typical floating-point number sizes were either 60-bit in the high-end models or 48-bit in cheaper models or 36-bit in the cheapest models.

[−] ant6n 26d ago

> In ancient times, floating point numbers were stored in 32 bits.

I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.

[−] convolvatron 26d ago
I was going to reply that just because intel did something funny doesn't mean that it was the beginning of the story. but it turns out that the release of the 8087 predates the ratification of IEEE floats by 2 years. in addition, the primary numeric designer for the 8087 was apparently Kahan, which means that they were both part of the same design process. of course there were other formats predating both of these