Python: The Optimization Ladder

[−] Ralfp 63d ago

    CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;

Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:

https://github.com/python/cpython/issues/139109

https://doesjitgobrrr.com/?goals=5,10

[−] josalhor 63d ago

While this is great, I expected faster CPython to eventually culminate into what YJIT for Ruby is. I'm not sure the current approaches they are trying will get the ecosystem there.

[−] kenjin4096 63d ago

I implemented most of the tracing JIT frontend in Python 3.15, with help from Mark to clean up and fix my code. I also coordinated some of the community JIT optimizer effort in Python 3.15 (note: NOT the code generator/DSL/infra, that's Mark, Diego, Brandt and Savannah). So I think I'm able to answer this.

I can't speak for everyone on the team, but I did try the lazy basic block versioning in YJIT in a fork of CPython. The main problem is that the copy-and-patch backend we currently have in CPython is not too amenable to self-modifying machine code. This makes inter-block jumps/fallthroughs very inefficient. It can be done, it's just a little strange. Also for security reasons, we tried not to have self-modifying code in the original JIT and we're hoping to stick to that. Everything has their tradeoffs---design is hard! It's not too difficult to go from tracing to lazy basic blocks. Conceptually they're somewhat similar, as the original paper points out. The main thing we lack is the compact per-block type information that something like YJIT/Higgs has.

I guess while I'm here I might as well make the distinction:

- Tracing is the JIT frontend (region selection).

- Copy and Patch is the JIT backend (code generation).

We currently use both. PyPy uses meta-tracing. It traces the runtime itself rather than the user's code in CPython's tracing case. I did take a look at PyPy's code, and a lot of ideas in the improved JIT are actually imported from PyPy directly. So I have to thank them for their great ideas. I also talk to some of the PyPy devs.

Ending off: the team is extremely lean right now. Only 2 people were generously employed by ARM to work on this full time (thanks a lot to ARM too!). The rest of us are mostly volunteers, or have some bosses that like open source contributions and allow some free time. As for me, I'm unemployed at the moment and this is basically my passion project. I'm just happy the JIT is finally working now after spending 2-3 years of my life on it :). If you go to Savannah's website [1], the JIT is around 100% faster for toy programs like Richards, and even for big programs like tomli parsing, it's 28% faster on macOS AArch64. The JIT is very much a community effort right now.

[1]: https://doesjitgobrrr.com/?goals=5,10

PS: If you want to see how the work has progressed, click "all time" in that website, it's pretty cool to see (lower is faster). I have a blog explaining how we made the JIT faster here https://fidget-spinner.github.io/posts/faster-jit-plan.html.

[−] vovavili 63d ago

Thank you for your contributions to the Python ecosystem. It definitely is inspiring to see Python, the language to which I owe my career and interest in tech, grow into a performant language year by year. This would not have been possible without individuals like you.

[−] pjmlp 63d ago

Now this is great to know.

[−] __mharrison__ 63d ago

Great writeup.

I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).

I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.

One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).

[−] bloaf 63d ago

Taichi, benchmarked in the article, claims to be able to outperform CUDA at some GPU tasks, although their benchmarks look to be a few years old:

https://github.com/taichi-dev/taichi_benchmark

[−] pjmlp 63d ago

And doesn't account for cuTitle, NVidia's new API infrastructure that supports writing CUDA directly in Python via a JIT that is based on MLIR.

[−] redgridtactical 63d ago

In practice the ladder has two rungs for me. Write it in Python with numpy/scipy doing the heavy lifting, and if that's not enough, rewrite the hot path in C. The middle steps always felt like they added complexity without fully solving the problem.

The JIT work kenjin4096 describes is really promising though. If the tracing JIT in 3.15 actually sticks, a lot of this ladder just goes away for common workloads.

[−] bee_rider 63d ago

Jax seems quite interesting even from this point of view… numpy has the same problem as blas basically, right? The limited interface. Eventually this leads to heresies like daxpby, and where does the madness stop once you’ve allowed that sort of thing? Better to create some sort of array language.

[−] redgridtactical 62d ago

Jax basically gives you the array language without leaving Python, and the XLA backend means you're not hand-tuning C for the GPU path. The numpy interface limitation is real though and once you need something that doesn't map cleanly to vectorized ops, you're either fighting the abstraction or dropping down anyway.

The daxpby example is a good one. Every time BLAS adds another special-case routine it's basically admitting the interface wasn't general enough. At some point you're just writing C with extra steps.

[−] mathisfun123 63d ago

this is a pointless (valueless) reductive take

[−] seanwilson 63d ago

> The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize. ...

> 4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts.

Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?

I can't think of cases in strong statically typed languages where I've wanted something like monkey patching, and when I see monkey patching elsewhere there's often some reasonable alternative or it only needs to be used very rarely.

[−] adamzwasserman 63d ago

The dynamism exists to support the object model. That's the actual dependency. Monkey-patching, runtime class mutation, vtable dispatch. These aren't language features people asked for. They're consequences of building everything on mutable objects with identity.

Strip the object model. Keep Python.

You get most of the speed back without touching a compiler, and your code gets easier to read as a side effect.

I built a demo: Dishonest code mutates state behind your back; Honest code takes data in and returns data out. Classes vs pure functions in 11 languages, same calculation. Honest Python beats compiled C++ and Swift on the same problem. Not because Python is fast, but because the object model's pointer-chasing costs more than the Python VM overhead.

Don't take my word for it. It's dockerized and on GitHub. Run it yourself: honestcode.software, hit the Surprise! button.

[−] adamzwasserman 62d ago

Correction. I copied some incorrect values from my test harness. So Honest Python does NOT beat Dishonest Swift.

But it does beat the pants off of JS/TS on V8 which is quite the surprise.

Also in the surprise category is that Honest Java is more than 2x faster than dishonest c++.

[−] bloaf 63d ago

I've always thought the flexibility should allow python to consume things like gRPC proto files or OpenAPI docs and auto-generate the classes/methods at runtime as opposed to using codegen tools. But as far as I know, there aren't any libraries out there actually doing that.

[−] haimez 63d ago

Generating code at runtime is often an anti-goal because you can’t easily introspect it. “Build-time” generation gives you that, but print often choose to go further and check the generated code to source control to be able to see the change history.

[−] bloaf 63d ago

But for things like e.g. DAG systems, it would be great to be able to upload a new API definition and have it immediately available instead of having to recompile anything in the backend.

[−] notmarkeloff 61d ago

[dead]

[−] skeledrew 63d ago

But it's an fairly easy build if you want any of that.

[−] NeutralForest 63d ago

There are some use cases for very dynamic code, like ORMs; with descriptors you can add attributes + behavior at runtime and it's quite useful. Anyways, breaking metaprogramming and more dynamic features would mean python 4 and we know how 2 -> 3 went. I also don't think it's where the core developers are going. Also also, there are other things I'd change before going after monkey patching like some scoping rules, mutable defaults in function attributes, better async ergonomics, etc.

[−] LtWorf 63d ago

I've used a library that patches the zipfile module to add support for zstd compression in zipfiles.

In python3.14 the support is there, but 2 years ago you could just import this library and it would just work normally.

[−] repple 63d ago

Significant AI smell in this write up. As a result, my current reflex is to immediately stop reading. Not judgement on the actual analysis and human effort which went in. It’s just that the other context is missing.

[−] huseyinkeles 63d ago

The author is from Turkey (where I’m also originally from).

Believe it or not, when you write a blog post in a different language, it really helps to use an LLM, even just to fix your grammar mistakes etc.

I assume that’s most likely what happened here too.

[−] canjobear 63d ago

Here's what gave it away for me

> The remaining difference is noise, not a fundamental language gap. The real Rust advantage isn't raw speed -- it's pipeline ownership.

[−] jb_hn 63d ago

I didn't notice any signs of AI writing until seeing this comment and re-reading (though I did notice it on the second pass).

That said, I think this article demonstrates that focusing on whether or not an article used AI might be focusing on the wrong “problem.” I appreciate being sensitive to the "smell" (the number of low-effort, AI posts flying around these days has made me sensitive too), but personally, I found this article both (1) easy to read and (2) insightful. I think the number of AI-written content lacking (2) is the problem.

[−] markisus 63d ago

I also seem to be developing an immune response to several slopisms. But the actual content is useful for outlining tradeoffs if you’re needing to make your Python code go faster.

[−] intoXbox 63d ago

Great write up and recognisable performance. For a pipeline with many (~50) build dependencies unfortunately switching interpreter or experimenting with free threading is not an easy route as long as packages are not available (which is completely understandable).

I’m not one of these rewrite in Rust types, but some isolated jobs are just so well sorted for full control system programming that the rust delegation is worth the investment imo.

Another part worth investigating for IO bound pipelines is different multiprocessing techniques. We recently got a boost from using ThreadPoolExecutor over standard multiprocessing, and careful profiling to identify which tasks are left hanging and best allocated its own worker. The price you pay though is shared memory, so no thread safety, which only works if your pipeline can be staggered

[−] rusakov-field 63d ago

Python is perfect as a "glue" language. "Inner Loops" that have to run efficiently is not where it shines, and I would write them in C or C++ and patch them with Python for access to the huge library base.

This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)

[−] kristianp 63d ago

          nbody spectral-norm
    C     2100ms    400ms
    Graal  211ms    212ms
    PyPy    98ms   1065ms

Seeing Graal and Pypy beat the gcc C versions suggests to me there's something wrong with the C version. Perhaps they need a -march=native or there's something else wrong. The C version would be a different implementation in the benchmark game, but usually they are highly optimised.

Edit: looking at [1] the top C version uses x86 intrinsics, perhaps the article's writer had to find a slower implementation to have it running natively on his M4 Pro? It would be good to know which C version he used, there's a few at [1]. The N-body benchmark is one where they specify that the same algorithm must be used for all implementations.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[−] blt 63d ago

Surprised Python is only 21x slower than C for tree traversal stuff. In my experience that's one of the most painful places to use Python. But maybe that's because I use numpy automatically when simple arrays are involved, and there's no easy path for trees.

[−] pjmlp 63d ago

Kudos for going through all the existing JIT approaches, instead of reaching for rewrite into X straight away.

However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.

[−] gregjm 62d ago

> I don't know JAX well enough to explain exactly why it's 3x faster than NumPy on the same matrix multiplications.

JAX is basically a frontend for the XLA compiler, as you note. The secret sauce is two insights - 1) if you have enough control, you can modify the layout of tensor computations and permute them so they don’t have to match that of the input program but have a more favorable memory access pattern; 2) most things are memory bound, so XLA creates fusion kernels that combine many computations together between memory accesses. I don’t know if the Apple BLAS library has fused kernels with GEMM + some output layer, but XLA is capable of writing GEMM fusions and might pick them if they autotune faster on given input/output shapes.

> But I haven't verified that in detail. Might be time to learn.

If you set the environment variable XLA_FLAGS=--dump_hlo_to=$DIRECTORY then you’ll find out! There will be a “custom-call” op if it’s dispatching to BLAS, otherwise it will have a “dot” op in the post-optimization XLA HLO for the module. See the docs:

https://openxla.org/xla/hlo_dumps

[−] Mawr 63d ago

Shockingly good article — correct identification of the root cause of performance issues being excessive dynamism and ranking of the solutions based on the value/effort ratio. Excellent taste. Will keep this in my back pocket as a quick Python optimization reference.

It's just somewhat unfortunate that I have to question every number and fact presented since the writing was clearly at least somewhat AI-assisted with the author seemingly not being upfront about that at all.

[−] superlopuh 63d ago

Missing Muna[0][1], I'm curious how it would compare on these benchmarks.

[0]: https://www.muna.ai/ [1]: https://docs.muna.ai/predictors/create

[−] gcanyon 63d ago

People here on HN have in the past suggested that TypeScript is the superior-in-all-ways, just-as-easy/fun-to-code-in language and should replace Python in pretty much all use cases.

Anyone have an opinion on how TS would fare in this comparison?

[−] mwkaufma 63d ago

All the approaches beyond PyPy are to either use a different lang that's superficially similar to python or to write a native extension for python in a different language, which is at odds with the stated premise.

[−] igouy 61d ago

> "The Benchmarks Game problems are pure compute: tight loops, no I/O, no data structures beyond arrays."

iirc reverse-complement reads and writes a GB, fasta and mandelbrot write, regex-redux reads, k-nucleotide reads and uses a hash table.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[−] Trickery5837 62d ago

It's missing the easiest of the choices: core performance-sensitive code in C, interface it to python with pybind11, build app in python. Small stack, huge gains, best of both worlds.

[−] markisus 63d ago

I wish there were more details on this part.

> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.

I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?

[−] adsharma 63d ago

Missing: write static python and transpile to rust pyO3 which is at the top of the ladder.

Some nuance: try transpiling to a garbage collected rust like language with fast compilation until you have millions of users.

Also use a combination of neural and deterministic methods to transpile depending on the complexity.

[−] LarsDu88 63d ago

I love how in an article about making python faster, the fastest option is to simply write Rust, lol

[−] alihawili 63d ago

when dealing with JSON in cpython, I always use msgspec, performance gains is huge

[−] superbatfish 61d ago

When I read an article about Python optimizations, I typically expect to have significant objections. But this one was great, actually.

[−] IshKebab 63d ago

Instead of just using a language that isn't dog slow, why not jump through these 5 different hoops? It's much easier!

[−] threethirtytwo 63d ago

>The usual suspects are the GIL, interpretation, and dynamic typing. All three matter, but none of them is the real story. The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize.

ok I guess the harder question is. Why isn't python as fast as javascript?

[−] retsibsi 63d ago

A personal opinion: I would much prefer to read the rough, human version of this article than this AI-polished version. I'm interested in the content and the author clearly put thought and effort into it, but I'm constantly thrown out of it by the LLM smell. (I'm also a bit mad that -- is now on the em dash treadmill and will soon be unusable.)

I'm not just saying this to vent. I honestly wonder if we could eventually move to a norm where people publish two versions of their writing and allow the reader to choose between them. Even when the original is just a set of notes, I would personally choose to make my own way through them.

[−] jaharios 63d ago

json.loads is something you don't want to use in a loop if you care for performance at all. Just simple using orjson can give you 3x speed without the need to change anything.

[−] kelvinjps10 63d ago

Great post saved it for when I need to optimize my python code

[−] viktorcode 63d ago

I was hoping for Mojo to appear as optimisation strategy

[−] zahlman 63d ago

The replacement of emdashes with double hyphens here is almost insulting. A look through the blog history suggests that the author has no issue writing in English normally, and nothing seems really off about the actual findings here (or even the speculation about causes etc.), so I really can't understand the motivation for LLM-generated prose. (The author's usual writing style appears to have some arguable LLM-isms, but they make a lot more sense in context and of course those patterns had to come from somewhere. The overall effect is quite different.)

Edit: it's strange to get downvoted while also getting replies that agree with me and don't seem to object.

(Also, I thought it wasn't supposed to be possible to edit after getting a reply?)

[−] arlattimore 63d ago

What a great article!

[−] elophanto_agent 63d ago

[dead]

[−] george_api_dev 63d ago

[dead]

[−] skeledrew 63d ago

I must admit that I'm amused by the people who find the writeup useful but are turned off by the AI "smell". And look forward to the day when all valued content reeks of said "smell"; let's see what detractors-for-no-good-reason do then (yes I'm a bit ticked by the attitude).

[−] perching_aix 63d ago

>

language slow

> looks inside

> the reference implementation of language is slow

Despite its content, this blogpost also pushes this exact "language slow" thinking in its preamble. I don't think nearly enough people read past introductions for that to be a responsible choice or a good idea.

The only thing worse than this is when Python specifically is outright taught (!) as an "interpeted language", as if an implementation-detail like that was somehow a language property. So grating.

Python: The Optimization Ladder (cemrehancavdar.com)

146 comments