Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg (khronos.org)

by y1n0 60 comments 166 points
Read article View on HN

60 comments

[−] null-phnix 57d ago
A lot of the confusion in this thread feels like it comes from thinking in terms of web streaming rather than the workloads this post is targeting.

The article is pretty explicit that this is not about "make Twitch more efficient" or squeezing a bit more perf out of H.264. It is about mezzanine and archival formats that are already way beyond what a single CPU, even a decade old workstation CPU, handles comfortably in real time: 4K/6K/8K+ 16‑bit, FFv1-style lossless, ProRes RAW, huge DPX sequences, etc. People cutting multi‑camera timelines of that kind of material are already on the wrong side of the perf cliff and are often forced into very specific hardware or vendors.

What Vulkan compute buys you here is not "GPUs good, CPUs bad", it is the ability to keep the entire codec pipeline resident on the GPU once the bitstream is there, using the same device that is already doing color, compositing and FX, and to do it in a portable way. FFmpeg’s model is also important: all the hairy parts stay in software (parsing, threading, error handling), and only the hot pixel crunching is offloaded. That makes this much more maintainable than the usual fragile vendor API route and keeps a clean fallback path when hardware is not available.

From a practical angle, this is less about winning a benchmark over a good CPU encoder for 4K H.264, and more about changing what is feasible on commodity hardware: e.g., scrubbing multiple streams of 6K/8K ProRes or FFv1 on a consumer GPU instead of needing a fat workstation or dailies transcoded to lighter proxies. For people doing archival work or high end finishing on a budget, that is a real qualitative change, not just an incremental efficiency tweak.

[−] Teknoman117 57d ago
To add to what you said, it’s also nice to be able to keep it within one API that’s platform agnostic when possible.

Sure we’ve had the ability to keep the pipeline on GPU for awhile, but it usually required platform specific API bindings to convert to a platform specific descriptor (handles on windows, IOSurface on macOS, dmabuf on Linux), which you then had to pull into a platform specific decoder/encoder API (DXGI, WMF, AVFoundation, VAAPI, etc.), and then all of that again but in reverse to get the surface back into your 3D API.

This whole thing just makes life easier for everyone.

[−] null-phnix 48d ago
Exactly. The cross-platform descriptor dance is one of those things that's invisible to people outside the pipeline but eats an absurd amount of dev time. You end up writing the same conversion logic three times for three platforms, each with its own failure modes and version quirks, and none of it has anything to do with the actual codec work.

Having Vulkan as the single surface for both the compute and the rendering side means one memory model, one synchronization story, one set of bugs to chase. That alone is worth the effort even before you get to the performance wins.

[−] pandaforce 57d ago
The main target for this are NLEs like Blender. Performance is a large part of the issue. Most users still just create TIFF files per frame before importing them into a "real editor" like Resolve. Apple may have ASICs for ProRes decoding, and Resolve may be the standard editor that everyone uses.

But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.

[−] jcelerier 57d ago
Yep! Almost finished implementing support in https://ossia.io which is going to become the first open-source cross-platform real-time visuals software to support live scrubbing for VJ use cases, in 4K+ prores files on not that big of a GPU (tested on my laptop 3060) :)
[−] westurner 57d ago
How to feed MilkDrop music visualizations?

(MilkDrop3, projectm-visualizer/presets-cream-of-the-crop, westurner/vizscan for photosensitive epilepsy)

mapmapteam/mapmap does open source multi-projector mapping. How to integrate e.g. mapmap?

BespokeSynth is a C++ and JUCE based patch bay software modular synth with a "node-based UI" and VST3, LV2, AudioUnit audio plugin support. How to feed BespokeSynth audio and possibly someday video? Pipewire and e.g. Helvum?

[−] jcelerier 57d ago
- MilkDrop: I'd love a PR that adds support for ProjectM :D it would be fairly easy to make a custom plug-in that just blits the texture.

Basic code for this would look like that:

    struct MilkdropIntegration
    {
      halp_meta(name, "ProjectM")
      halp_meta(c_name, "projectm")
      halp_meta(category, "Visuals")
      halp_meta(author, "ProjectM authors")
      halp_meta(description, " :) ")
      halp_meta(uuid, "417534da-3625-404a-b74f-91d003cb64b9")
    
      // By know you know the drill: define inputs, outputs...
      struct
      {    
        struct : halp::lineedit<"Program", "">
        {
          halp_meta(language, "eel2")
        } program;
      } inputs;
    
      struct
      {
        struct
        {
          halp_meta(name, "Out");
          halp::rgba_texture texture;
        } image;
      } outputs;
    
      halp::rgba_texture::uninitialized_bytes bytes;
    
      void operator()()
      {
        if(bytes.empty())    
          bytes = halp::rgba_texture::allocate(800, 600); // or whatever resolution you wanna set
          
        // Fill in bytes with your custom pixel data here
        
        outputs.image.texture.update(bytes.data(), 800, 600);
      }
    };
inside such a template: https://github.com/ossia-templates/score-avnd-simple-templat...

- multi-projector mapping: ossia actually does it directly! it's in git master, will be released in the next version. It also supports a fair amount of features that MapMap does not have such as:

* soft-edge blending

* blend modes

* custom polygons

* a proper HDR passthrough as well as tonemapping, etc.

* Metal, Vulkan, D3D11/12 support (mapmap is opengl-only)

* Spout, Syphon, NDI, soon pipewire video. Mapmap only supports camera input.

* HAP and DXV, both decoded on GPU.

* Smooth grid distortion. Here's mapmap grid distortion: https://streamable.com/1nhwxg vs ossia with sufficiently high subdivisions: https://streamable.com/hmb1jm

* And of course as mentioned here hw decoding (for some years already), the new feature adds zero-copy when for instance using vulkan video and the vulkan GPU backend.

* In addition pretty much every YUV pixel format in existence is GPU-decoded (https://github.com/ossia/score/tree/master/src/plugins/score...).

In contrast Mapmap does gstreamer -> Qt ; everything including the Yuv -> RGBA conversion goes through the CPU.

- How to feed BespokeSynth audio and possibly someday video? Pipewire and e.g. Helvum?

yes, pipewire (or jack or blackhole on windows and macOS). Although ossia also supports, VST, VST3, LV2, CLAP, JSFX, and Faust and comes with many audio effects built-in already.

[−] dagmx 57d ago
I don’t understand the spread of thoughts in your post.

The reason to create image sequences is not because you need to send it to other apps, it’s because you preserve quality and safeguard from crashes.

A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.

People aren’t going to stop using image sequences even if they stayed in the same app.

And I’m not sure why this applies: “this goes beyond” what Apple has, because they do have hardware support for decoding several compressed codecs (also I’ll note that ProRes is also compressed). Other than streaming, when are you going to need that kind of encode performance? Or what other codecs are you expecting will suddenly pop up by not requiring ASICs?

Also how does this remove degradation when going between apps? Are you envisioning this enables Blender to stream to an NLE without first writing a file to disk?

[−] pandaforce 57d ago

> A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.

You wouldn't contain FFv1 in MP4, the only format incompetent enough for such corruption.

Apple has an interest against people using codecs that they get no fees from. And Apple don't have a lossless codec. So they don't offer lossless compressed video acceleration.

The idea is that when working as a part of a team, and you get handed a CG render, you can avoid sending a huge .tar or .zip file full of TIFF which you then decompress, or ProRes which loses quality, particularly when in a linear colorspace like ACEScg.

[−] dagmx 57d ago
I’m curious what kind of teams you’re working in that you’re handing compressed archives of image sequences? And using tiff vs EXR (unless you mean purely after compositing)?

Another reason to use image sequences is that it’s easier to re-render just a portion of the sequence easily. Granted this can be done with video too, but has higher overhead.

But even then why does the GPU encoding change the fact that you’d send it to another NLE? I just feel like there are a lots of jump in thought process here.

[−] jobigoud 57d ago
I thought an industry standard was to use proxy files. Open source editor Shotcut use them for example. Create a low resolution + intra-frame only version of the file for very fast scrubbing, make your edits on that, and when done the edit list is applied to the full resolution rushes to produce the output.
[−] hilsdev 57d ago
Often but not always. Sometimes you’re just working with proxies directly, audio mixing and the like. VFX workflows, finishing will be online full res often.

But even so everybody is often making their own proxies all the time. There’s a lot of passing around of ProRes Proxy or another intermediate quality format and you still make even lighter proxies locally so NLEs and workstation apps will still benefit from this

[−] pandaforce 57d ago
Proxy files have issues when doing coloring, greenscreens, effects shots. The bit depth, chroma resolution, primaries/transfer/colorspace gets changed. Basically only really usable when editing. With this, you don't need proxy files at all.
[−] sylware 57d ago
Well, the problem with hardware decoding is it cannot handle all the variations in data corruption which results in hardware crash, sometimes not recoverable with a soft reset of the hardware block.

It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.

Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.

Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.

What a mess.

I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.

[−] pandaforce 57d ago
These are all gripes you might have with Vulkan Video. Unlike with Vulkan Video, in Compute, bounds checking is the norm. Overreading a regular buffer will not result in a GPU hang or crash. If you use pointers, it will, but if you use pointers, its up to you to check if overreads can happen.

The bitstream reader in FFmpeg for Vulkan Compute codecs is copied from the C code, along with bounds checking. The code which validates whether a block is corrupt or decodable is also taken from the C version. To date, I've never got a GPU hang while using the Compute codecs.

[−] averne_ 57d ago
I wrote the Vulkan ProRes backend. The bitstream decoder was implemented from scratch, for a number of reasons.

First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)

And to answer the parent's question, the shaders are written in pure GLSL. For instance, this is the ProRes bitstream decoder in question: https://code.ffmpeg.org/FFmpeg/FFmpeg/src/branch/master/liba...

[−] sylware 57d ago
glsl: this is the really bad part, as this is a definitive nono.

Should have been a plaind and simple C coded generator of SPIR-V byte code.

[−] positron26 57d ago

> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.

One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.

Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.

At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.

[−] pandaforce 57d ago
Yeah, Vulkan is shedding most of the abstractions off. Buffers are no longer needed - just device addresses. Shaders don't need to be baked into a pipeline - you can use shader objects. Even images rarely provide any speedup advantages over buffers, since texel cache is no longer separate from memory cache.

GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.

[−] mort96 57d ago
These GPUs are still big SIMD devices at their core though, no?
[−] positron26 57d ago
SIMT is distinct model. Ergonomics are wildly different. Instead of contracting a long iteration by packing its steps together to make them "wider", you rotate the iteration across cores.

The critical difference is that SIMD and parallel programming are totally different in terms of ergonomics while SIMT is almost exactly the same as parallel programming. You have to design for SIMD and parallelism separately while SIMT and parallelism are essentially the same skill set.

The fan-in / fan-out and iteration rotation are the key skills for SIMT.

[−] jokoon 57d ago
I once asked on #ffmpeg@libera if the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.

I don't know much about video compression, does that mean that a codec like h264 is not parallelizable?

[−] returnorthrow 57d ago
I was just reading about this topic last night on Maister’s Graphics Adventures: https://themaister.net/blog/2025/06. They created PyroWave, a GPU-accelerated DWT codec in Vulkan compute shaders for performing very low latency game streaming. Very fascinating read, and builds on similar work from their master’s thesis too.
[−] hirako2000 57d ago
Vulkan Compute shaders make GPU acceleration practical for intensive codecs like FFv1, ProRes RAW, and DPX. Previous hybrid GPU + CPU suffered the round-trip latency. These are fully GPU hands offs. A big deal for editing workflows.
[−] kvbev 57d ago
could this have an AV1 decoder for low power hardware that are without AV1 gpu accelerated decoding? for my N4020 laptop.

maybe a raspberry pi 4 too.

[−] reactordev 57d ago
This makes vision models go “Oooooo”. /s

Honestly, everything now is just math. It’s all about how much you can do at once. Vulkan being Vulkan, that ceiling is your hardware. Go ham.

[−] fhn 57d ago
This article assumes all GPUs are on a PCIe bus but some are part of the CPU so the distance problem is minimal and offloading to GPU might still be net +. Might because I haven't tested this
[−] doctorpangloss 57d ago
What is the use case? Okay, ultra low latency streaming. That is good. But. If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency. The CPU encoders are much better quality and more flexible. So it's very difficult to justify the level of complexity needed for hardware video encoding. Absolutely no need for it for TV streaming for example. But people keep being obsessed with it who have no need for it.

IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.