FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

(tomshardware.com)

389 points | by harambae 1 day ago

15 comments

AaronAPU 1 day ago
When I spent a decade doing SIMD optimizations for HEVC (among other things), it was sort of a joke to compare the assembly versions to plain c. Because you’d get some ridiculous multipliers like 100x. It is pretty misleading, what it really means is it was extremely inefficient to begin with.
The devil is in the details, microbenchmarks are typically calling the same function a million times in a loop and everything gets cached reducing the overhead to sheer cpu cycles.
But that’s not how it’s actually used in the wild. It might be called once in a sea of many many other things.
You can at least go out of your way to create a massive test region of memory to prevent the cache from being so hot, but I doubt they do that.
[-]
- torginus 23 hours ago
  Sorry for the derail, but it sounds like you have a ton of experience with SIMD.
  Have you used ISPC, and what are your thoughts on it?
  I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.
  [-]
  - jandrewrogers 18 hours ago
    The reason you have to optimize SIMD by hand is that compilers can't redesign your data structures and algorithms. This is the level of abstraction at which you often need to be working with SIMD. Compilers are limited to codegen things like auto-vectorizing simple loops, but that isn't where most of the interesting possibilities are with SIMD.
    If you look at heavily-optimized SIMD code side-by-side with the equivalent heavily-optimized scalar code, they are often almost entirely unrelated implementations. That's the part compilers can't do.
    Note that I use SIMD heavily in a lot of domains that aren't just brute-forcing a lot straightforward numerics. If you are just brute-forcing numerics, that auto-vectorization works pretty well. For example, I have a high-performance I/O scheduler that is almost pure AVX-512 and the compiler can't vectorize any of that.
    [-]
    - lenkite 18 hours ago
      Hope you one day can get around to writing some blog posts or maybe even publish some books/courses on this - it sounds interesting!
      [-]
      - dwood_dev 17 hours ago
        You'll find the video in this submission a fun watch if you want to learn a bit about optimization and AVX-512.
        https://news.ycombinator.com/item?id=44176729
  - capyba 21 hours ago
    Personally I’ve never been able to beat gcc or icx autovectorization by using intrinsics; often I’m slower by a factor of 1.5-2x.
    Do you have any wisdom you can share about techniques or references you can point to?
    [-]
    - jesse__ 19 hours ago
      I recently finished a 4 part series about vectorizing perlin noise.. from the very basics up to beating the state-of-the-art by 1.8x
      https://scallywag.software/vim/blog/simd-perlin-noise-i
      [-]
      - capyba 8 hours ago
        Very cool, thank you for sharing!
    - mort96 11 hours ago
      You may not be able to beat the auto-vectorizer for problems which can be auto-vectorized, but you have almost certainly encountered situations which the compiler can't auto-vectorize but where you could've manually written a SIMD implementation which beats the compiler's scalar code.
      [-]
      - capyba 5 hours ago
        Excellent point, that makes a lot of sense. You’re absolutely right that I’ve only tried problems that could be auto vectorized…
  - brandmeyer 17 hours ago
    ISPC suffers from poor scatter and gather support in hardware. The direct result is that it is hard to make programs that scale in complexity without resorting to shenanigans.
    An ideal scatter-read or gather-store instruction should take time proportional to the number of cache lines that it touches. If all of the lane accesses are sequential and cache line aligned it should take the same amount of time as an aligned vector load or store. If the accesses have high cache locality such that only two cache lines are touched, it should cost exactly the same as loading those two cache lines and shuffling the results into place. That isn't what we have on x86-AVX512. They are microcoded with inefficient lane-at-a-time implementations. If you know that there is good locality of reference in the access, then it can be faster to hand-code your own cache line-at-a-time load/shuffle/masked-merge loop than to rely on the hardware. This makes me sad.
    ISPC's varying variables have no way to declare that they are sequential among all lanes. Therefore, without extensive inlining to expose the caller's access pattern, it issues scatters and gathers at the drop of a hat. You might like to write your program with a naive x[y] (x a uniform pointer, y a varying index) in a subroutine, but ISPC's language cannot infer that y is sequential along lanes. So, you have to carefully re-code it to say that y is actually a uniform offset into the array, and write x[y + programIndex]. Error-prone, yet utterly essential for decent performance. I resorted to munging my naming conventions for such indexes, not unlike the Hungarian notation of yesteryear.
    Rewriting critical data structures in SoA format instead of AoS format is non-trivial, and a prerequisite to get decent performance from ISPC. You cannot "just" replace some subroutines with ISPC routines, you need to make major refactorings that touch the rest of the program as well. This is neutral in an ISPC-versus-intrinsics (or even ISPC-versus-GPU) shootout, but it is worth mentioning only to point out that ISPC is also not a silver bullet in this regards, either.
    Non-minor nit: The ISPC math library gives up far too much precision by default in the name of speed. Fortunately, Sleef is not terribly difficult to integrate and use for the 1-ulp max rounding error that I've come to expect from a competent libm.
    Another: The ISPC calling convention adheres rather strictly to the C calling convention... which doesn't provide any callee-saved vector registers, not even for the execution mask. So if you like to decompose your program across multiple compilation units, you will also notice much more register save and restore traffic than you would like or expect.
    I want to like it, I can get some work done in it, and I did get significant performance improvements over scalar code when using it. But the resulting source code and object code are not great. They are merely acceptable.
  - rerdavies 18 hours ago
    Regular compilers are actually extraordinarily good at auto-vectorizing. There are a few oddities that one has to be aware of; but in my experience, if you offer a GCC or Clang compiler an opportunity to auto-vectorize, it will leap on it pretty ruthlessly. I have done a lot of work recently with auto-vectorizing code for high-performance realtime audio processing. And it's pretty extraordinary how good GCC and Clang are. (MSVC is allegedly equally good, but I don't have recent experience with it).
    And they will generally produce better assembler than I can for all but the weird bit-twiddly Intel special-purpose SIMD instructions. I'm actually professionally good at it. But the GCC instruction scheduler seems to be consistently better than I am at instruction scheduling. GCC and Clang actually have detailed models of the execution pipelines of all x64 and aarch64 processors that are used to perform instruction scheduling . Truly an amazing piece of work. So their instruction scheduling is -- as far as I can tell without very expensive Intel and ARM profiling tools -- immaculate across a wide variety of processors and architectures.
    And if it vectorizes on x64, it will probably vectorize equally well on aarch64 ARM Neon as well. And if you want optimal instruction scheduling for an Arm Cortex A72 processor (a pi 4), well, there's a switch for that. And for an A76 (a Pi 5). And for an Intel N100.
    The basics:
    - You need --fastmath -O3 (or the MSVC eqivalent).
    - You need to use and understand the "restrict" keyword (or closest compiler equivalent). If the compiler has to guard against aliasing of input and output arrays, you'll get performance-impaired code.
    - You can reasonably expect any math in a for loop to be vectorized provided there aren't dependencies between iterations of the loop.
    - Operations on arrays and matrices whose size is fixed at compile time are significantly better than operations that operate on matrices and arrays whose size is determined at runtime. (Dealing with the raggedy non-x4 tail end of arrays is easier if the shape is known at compile time).
    - trust the compiler. It will do some pretty extraordinary things. (e.g. generating 7 bit-twiddly SIMD instructions to vectorize atanf(float4). But verify. Make sure it is doing what you expect.
    - the cycle is: compile the C/C++ code; disassemble to make sure it did vectorize properly (and that it doesn't have checks for aliased pointers); profile; repeat until done, or forever, whichever comes first.
    - Even for fairly significant compute, execution is likely to be dominated by memory reads and writes (hopefully cached memory reads and writes). My Neural net code spends about 80% of its time waiting for reads and writes to various cache levels. (And 15% of its time doing properly vectorized atanf operations, some significant portion of which involves memory reads and writes that spill L1 cache). The FFT code spends pretty close to 100% of its time waiting for memory reads and writes (even with pivots that improve cache behavior). The actual optimization effort was determing (at compile time) when to do pivot rounds to improve use of L1 and L2 cache. I would expect this to be generally true. You can do a lot of compute in the time it takes for an L1 cache miss to execute.
    I can't think of why ISPC would do better than GCC,given what GCC actually does. My suspicion is that ISPC was a technology demonstration rather than a real product produced at a time when major compilers had no support at all for SIMD.
    [-]
    - the8472 13 hours ago
      Good at autovectorizing? Eh, if you massage your code in godbolt until it kicks in. But just naively writing code often falls off the happy path. And even when it autovectorizes it can sometimes produce such gems as
      vmovdqu xmm1, xmmword ptr [rdx] vmovdqu xmm0, xmmword ptr [rdx + 16] mov rax, rdi vmovd ecx, xmm1 or byte ptr [rsi], cl vpextrb ecx, xmm1, 1 or byte ptr [rsi + 1], cl vpextrb ecx, xmm1, 2 or byte ptr [rsi + 2], cl vpextrb ecx, xmm1, 3 or byte ptr [rsi + 3], cl vpextrb ecx, xmm1, 4 or byte ptr [rsi + 4], cl vpextrb ecx, xmm1, 5 or byte ptr [rsi + 5], cl ... vpextrb ecx, xmm0, 15 or byte ptr [rsi + 31], cl vmovups ymm0, ymmword ptr [rsi] vmovups ymmword ptr [rdi], ymm0
      instead of a vorps
  - almostgotcaught 21 hours ago
    > Have you used ISPC
    No professional kernel writer uses Auto-vectorization.
    > I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand
    You feel it's ridiculous because you've been sold a myth/lie (abstraction). In reality the details have always mattered.
    [-]
    - CyberDildonics 19 hours ago
      ISPC is a lot different from C++ compiler auto vectorization and it works extremely well. Have you tried it or not? If so where does it actually fall down? It warns you when doing slow stuff like gathers and scatters.
      [-]
      - camel-cdr 7 hours ago
        Is it a lot different from autovec with #pragma omp simd? I played around with ISPC a bit and it didn't seem much different.
        [-]
        CyberDildonics 5 hours ago
        I don't have any experience with that aspect of openmp. When you use ISPC are you using the varying and uniform keywords? You can write something that is almost C but it basically forced to vectorize.
        If you both are vectorizing the same thing there might not be much difference.
        If both are not vectorizing there might not be much difference, but with ISPC you can easily make sure that it does use vectorization and the best instruction set for your CPU.
    - rerdavies 18 hours ago
      I'm hard pressed to think of a kernel function that would benefit from auto-vectorization.
      [-]
      - aa-jv 14 hours ago
        A good example is a vector addition kernel, which is simple, embarrassingly parallel, and well-suited for SIMD:
        void vector_add(float *a, float *b, float *c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } }
        [-]
        Const-me 8 hours ago
        Indeed, automatic vectorizers do such simple things pretty reliably these days.
        However, if you build your software from kernels like that, you leave a lot of performance on the table. For example, each core of my Zen 4 CPU at base frequency can add FP32 numbers with AVX1 or AVX-512 at 268 GB/sec, which results in 806 GB/sec total bandwidth for your kernel with two inputs and 1 output.
        However, dual-channel DDR5 memory in my computer can only deliver 83 GB/sec bandwidth shared across all CPU cores. That’s an order of magnitude difference for a single threaded program, and almost 2 orders of magnitude difference when computing something on the complete CPU.
        Even worse, the difference between compute and memory widens over time. The next generation Zen 5 CPUs can add numbers twice as fast per cycle if using AVX-512.
        For this reason, ideally you want your kernels to do much more work with the numbers loaded from memory. That’s why efficient compute kernels are often way more complicated than the for loop in your example. Sadly, seems modern compilers can only reliably autovectorize very simple loops.
- Cthulhu_ 12 hours ago
  Does FFmpeg and co have "macrobenchmarks" as well? I would imagine software like that would have a diverse set of videos and a bajillion different encoding / decoding / transformation sets that are used to measure performance (time, cpu, file size, quality) over time. But it would need dedicated and consistent hardware to test all of that.
  [-]
  - eru 11 hours ago
    You don't actually need neither dedicated nor consistent hardware, if you are willing to do some statistics.
    Basically, you'd do a block design (https://en.wikipedia.org/wiki/Blocking_(statistics)): on any random hardware you have, you run both versions back to back (or even better, interleaved), and note down the performance.
    The idea that the differences in machines themselves and anything else running on them are noise, and you are trying to design your experiments in such a way that the noise should affect arms of the experiment in the same way---at least statistically.
    Downsides: you have to do more runs and do more statistics to deal with the noise.
    Upside: you can use any old hardware you have access to, even if it's not dedicated. And the numbers are arguably going to be more representative of real conditions, and not just a pristine lab environment.
- izabera 22 hours ago
  ffmpeg is not too different from a microbenchmark, the whole program is basically just: while (read(buf)) write(transform(buf))
  [-]
  - vlovich123 18 hours ago
    > However, the developers were soon to clarify that the 100x claim applies to just a single function, “not the whole of FFmpeg.”
    So OP is correct. The 100x speed up is according to some misleading micro benchmark. The reason is that that transform is a huge amount of code and as OP said this will blow out the code cache while the amount of data you’re processing results in a blowout of the data cache. Net overall improvement might be 1% if even that.
    [-]
    - majewsky 12 hours ago
      > The 100x speed up is according to some misleading micro benchmark.
      Honestly though, nobody who has any idea how anything works would have expected ffmpeg to suddenly unearth a 100x speedup for everything. That's why the devs did not clarify this right away. It's too laughable of an assumption.
  - sgarland 9 hours ago
    /r/restofthefuckingowl
  - fuzztester 21 hours ago
    the devil is in the details (of the holy assembly).
    thus sayeth the lord.
    praise the lord!
- bee_rider 17 hours ago
  Sadly, even beyond the hot cache issue,
  > They would later go on to elaborate that the functionality, which might enjoy a 100% speed boost, depending upon your system, was “an obscure filter.”
  However, to be fair, they communicate this stuff very clearly.
- yieldcrv 22 hours ago
  > what it really means is it was extremely inefficient to begin with
  I care more about the outcome than the underlying semantics, to me thats kind of a given
  [-]
  - dylan604 19 hours ago
    Any average user would be more concerned with the end results. But this is a forum of not average users and more the people specifically that get off on the underlying semantics. Just look at how many people here are so infatuated with AI/LLMs and are so concerned about training data, models, number of tokens, yet completely gloss over the fact that every single one of these products will just make shit up. Those concerned with the underlying semantics seem to not give a damn about the outcome.
Aardwolf 1 day ago
The article somtimes says 100x, other times it says 100% speed boost. E.g. it says "boosts the app’s ‘rangedetect8_avx512’ performance by 100.73%." but the screenshot shows 100.73x.
100x would be a 9900% speed boost, while a 100% speed boost would mean it's 2x as fast.
Which one is it?
[-]
- ethan_smith 23 hours ago
  It's definitely 100x (or 100.73x) as shown in the screenshot, which represents a 9973% speedup - the article text incorrectly uses percentage notation in some places.
  [-]
  - andersmurphy 14 hours ago
    The article probably contains traces of blue.
    [-]
    - VladVladikoff 9 hours ago
      What does this mean?
      [-]
      - kaitak 8 hours ago
        https://news.ycombinator.com/item?id=44627072
- MadnessASAP 1 day ago
  100x to the single function 100% (2x) to the whole filter
- pizlonator 1 day ago
  The ffmpeg folks are claiming 100x not 100%. Article probably has a typo
  [-]
  - k_roy 22 hours ago
    That would be quite the percentage difference with 100x
- torginus 1 day ago
  I'd guess the function operates of 8 bit values judging from the name. If the previous implementation was scalar, a double-pumped AVX512 implementation can process 128 elements at a time, making the 100x speedup plausible.
- ErrorNoBrain 23 hours ago
  [flagged]
blueflow 11 hours ago
Headline: "FFmpeg devs boast of another 100x leap thanks to handwritten assembly code"
Text: "... this boost is only seen in an obscure filter", "... up to ... %"
[expletives omitted]
ivanjermakov 22 hours ago
Related: ffmpeg's guide to writing assembly: https://news.ycombinator.com/item?id=43140614
cpncrunch 22 hours ago
Article is unclear what will actually be affected. It mentions "rangedetect8_avx512" and calls it an obscure function. So, what situations is it actually used for, and what is the real-time improvement in performance for the entire conversion process?
[-]
- nwallin 18 hours ago
  Back in ye olden tymes, video was an analog signal. It was wibbly wobbly waves. You could look at them with an oscilloscope. One of the things that it used to do to make stuff work was to encode control stuff in band. This is sorta like putting editor notes in a text file. There might be something like <someone should look up whether or not this is actually how editor's notes work>. In particular, it used blacks which were blacker than black to signal when it was time to go to the next line, or to go to the next frame.
  In later times, we developed DVDs. DVDs were digital. But they still had to encode data that would ultimately be sent across an analog cable to an analog television that would display the analog signal. So DVDs used colors darker than 16 (out of 255) to denote blacker than black. This digital signal would be decoded to an analog signal and were sent directly onto the wire. So while DVDs are ostensibly 8 bit per channel color, it's more like 7.9 bits per channel. This is also true for BluRay and HDMI.
  In more recent times, we've decided we want that extra 0.1 bits back. Some codecs will encode video that uses the full range of 0-255 as in-band signal.
  The problem is that ... sometimes people do a really bad job of telling the codec whether the signal range is 0-255 or 16-255. And it really does make a different. Sometimes you'll be watching a show or movie or whatever and the dark parts will be all fucked up. There are several reasons this can happen, one of which is because the black level is wrong.
  It looks like this function determines scans frames for whether all the pixels are in the 16-255 or 0-255 range. If a codec can be sure that the pixel values are 16-255, it can saves some bits will encoding. But I could be wrong.
  I do video stuff at my day job, and much to my own personal shame, I do not handle black levels correctly.
  [-]
  - sgarland 9 hours ago
    I’m positive you know this already, but for anyone else, this section [0] of the lddecode project has a wonderful example of all the visible and non-visible portions of an analog video signal.
    The project as a whole is also utterly fascinating, if you find the idea of pulling an analog RF signal from a laser and then doing software ADC interesting.
    [0]: https://github.com/happycube/ld-decode/wiki/ld-analyse#under...
  - zerocrates 15 hours ago
    Maybe you could get some savings from the codec by knowing if the range is full or limited, but probably the more useful thing is to just be able to flag the video correctly so it will play right, or to know that you need to convert it if you want, say, only limited-range output.
    Also as an aside, "limited" is even more limited than 16-255, it's limited on the top end also: max white is 235, and the color components top out at 240.
    [-]
    - nwallin 14 hours ago
      I fully agree with everything you said.
- brigade 20 hours ago
  It's not conversion. Rather, this filter is used for video where you don't know whether the pixels are video or full range, or whether the alpha is premultiplied, and determining that information. Usually so you can tag it correctly in metadata.
  And the function in question is specifically for the color range part.
  [-]
  - cpncrunch 20 hours ago
    It's still unclear from your explanation how it's actually used in practice. I run thousands of ffmpeg conversions every day, so it would be useful to know how/if this is likely to help me.
    Are you saying that it's run once during a conversion as part of the process? Or that it's a specific flag that you give, it then runs this function, and returns output on the console?
    (Either of those would be a one-time affair, so would likely result in close to zero speed improvement in the real world).
    [-]
    - brigade 20 hours ago
      This is a new filter that hasn’t even been committed yet, it only runs if explicitly specified, and would only ever be specified by someone that already knows that they don’t know the characteristics of their video.
      So you wouldn’t ever run this.
      [-]
      - cpncrunch 17 hours ago
        Thank you, exactly what I was looking for.
pavlov 1 day ago
Only for x86 / x86-64 architectures (AVX2 and AVX512).
It’s a bit ironic that for over a decade everybody was on x86 so SIMD optimizations could have a very wide reach in theory, but the extension architectures were pretty terrible (or you couldn’t count on the newer ones being available). And now that you finally can use the new and better x86 SIMD, you can’t depend on x86 ubiquity anymore.
[-]
- Aurornis 1 day ago
  AVX512 is a set of extensions. You can’t even count on an AVX512 CPU implementing all of the AVX512 instructions you want to use, unless you stick to the foundation instructions.
  Modern encoders also have better scaling across threads, though not infinite. I was in an embedded project a few years ago where we spent a lot of time trying to get the SoC’s video encoder working reliably until someone ran ffmpeg and we realized we could just use several of the CPU cores for a better result anyway
tombert 22 hours ago
Actually a bit surprised to hear that assembly is faster than optimized C. I figured that compilers are so good nowadays that any gains from hand-written assembly would be infinitesimal.
Clearly I'm wrong on this; I should probably properly learn assembly at some point...
[-]
- mananaysiempre 22 hours ago
  Looking at the linked patches, you’ll note that the baseline (ff_detect_range_c) [1] is bog-standard scalar C code while the speedup is achieved in the AVX-512 version (ff_detect_rangeb_avx512) [2] of the same computation. FFmpeg devs prefer to write straight assembly using a library of vector-width-agnostic macros they maintain, but at a glance the equivalent code looks to be straightforwardly expressible in C with Intel intrinsics if that’s more your jam. (Granted, that’s essentially assembly except with a register allocator, so the practical difference is limited.) The vectorization is most of the speedup, not the assembly.
  To a first approximation, modern compilers can’t vectorize loops beyond the most trivial (say a dot product), and even that you’ll have to ask for (e.g. gcc -O3, which in other cases is often slower than -O2). So for mathy code like this they can easily be a couple dozen times behind in performance compared to wide vectors (AVX/AVX2 or AVX-512), especially when individual elements are small (like the 8-bit ones here).
  Very tight scalar code, on modern superscalar CPUs... You can outcode a compiler by a meaningful margin, sometimes (my current example is a 40% speedup). But you have to be extremely careful (think dependency chains and execution port loads), and the opportunity does not come often (why are you writing scalar code anyway?..).
  [1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346725.h...
  [2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346726.h...
  [-]
  - kasper93 21 hours ago
    Moreover the baseline _c function is compiled with -march=generic and -fno-tree-vectorize on GCC. Hence it's the best case comparison for handcrafted AVX512 code. And while it's is obviously faster and that's very cool, boasting the 100x may be misinterpreted by outsider readers.
    I was commenting there with some suggested change and you can find more performance comparison [0].
    For example with small adjustment to C and compiling it for AVX512:
```
  after (gcc -ftree-vectorize --march=znver4)
  detect_range_8_c:                                      285.6 ( 1.00x)
  detect_range_8_avx2:                                   256.0 ( 1.12x)
  detect_range_8_avx512:                                 107.6 ( 2.65x)
```
    Also I argued that it may be a little bit misleading to post comparison without stating the compiler and flags used for said comparison [1].
    P.S. There is related work to enable -ftree-vectorize by default [2]
    [0] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346813.h...
    [1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346794.h...
    [2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346439.h...
    [-]
    - etra0 6 hours ago
      I think this comment should be on the top lol.
      I mean, I love ffmpeg, I use it a lot and it's fantastic for my needs, but I've found their public persona often misleading and well, this just confirms my bias.
      > We made a 100x improvement over incredibly unoptimized C by writing heavily specific cpu instructions that the compiler cannot use because we don't allow it!
      2x is still an improvement, but way less outstanding as they want it to publicize it because they used assembly.
  - nwallin 18 hours ago
    > the equivalent code looks to be straightforwardly expressible in C with Intel intrinsics if that’s more your jam. (Granted, that’s essentially assembly except with a register allocator, so the practical difference is limited.) The vectorization is most of the speedup, not the assembly.
    At my day job I have a small pile of code I'm responsible for which is a giant pile of intrinsics. We compile to GCC and MSVC. We have one function that is just a straight function. There are no loops, there is one branch. There is nothing that isn't either a vector intrinsic or an address calculation which I'm pretty sure is simple enough that it can be included in x86's fancy inline memory address calculation thingie. There's basically nothing for the compiler to do except translate vector intrinsics and address calculation from C into assembly.
    The code when run in GCC is approximately twice as fast as MSVC.
    The compiler is also responsible for register allocation, and MSVC is criminally insane at it.
    One of these days I'll have to rewrite this function in assembly, just to get MSVC up to GCC speeds, but frankly I don't want to.
    [-]
    - Const-me 7 hours ago
      I’m not sure it’s register allocation. VC++ is indeed less than ideal, but 2x performance difference is IMO too much to explain by register allocation alone.
      First check that you’re passing correct flags to VC++ compiler and linker: optimized release configuration, correct /arch switch, and ideally LTCG. BTW if you’re using cmake build system it’s rather hard to do, cmake support of VC++ compiler is not great.
      Another thing, VC++ doesn’t like emitting giant functions. A possible reason for 2x difference VC++ failed to inline stuff, and your function has calls to other functions instead of a single large one. Note the compiler supports __forceinline keyword to work around that.
- mafuy 21 hours ago
  If you ever dabble more closely in low level optimization, you will find the first instance of the C compile having a brain fart within less than an hour.
  Random example: https://stackoverflow.com/questions/71343461/how-does-gcc-no...
  The code in question was called quadrillions of times, so this actually mattered.
- jesse__ 19 hours ago
  It's extremely easy to beat the compiler by dropping down to SIMD intrinsics. I recently wrote a 4 part .. guide? ..
  https://scallywag.software/vim/blog/simd-perlin-noise-i
- brigade 21 hours ago
  It's AVX512 that makes the gains, not assembly. This kernel is simple enough that it wouldn't be measurably faster than C with AVX512 intrinsics.
  And it's 100x because a) min/max have single instructions in SIMD vs cmp+cmov in scalar and b) it's operating in u8 precision so each AVX512 instruction does 64x min/max. So unlike the unoptimized scalar that has a throughput under 1 byte per cycle, the AVX512 version can saturate L1 and L2 bandwidth. (128B and 64B per cycle on Zen 5.)
  But, this kernel is operating on an entire frame; if you have to go to L3 because it's more than a megapixel then the gain should halve (depending on CPU, but assuming Zen 5), and the gain decreases even more if the frame isn't resident in L3.
  [-]
  - saati 20 hours ago
    The AVX2 version was still 64x faster than the C one, so AVX-512 is just 50% improvement over that. Hand vectorized assembly is very much the key to the gains.
    [-]
    - brigade 19 hours ago
      The only material difference AVX2 makes is that it can't saturate L1 bandwidth. Which would imply that 100x for AVX-512 is only for frames that fit within L1.
      And... yep, the benchmark on 256x16 frames. [1]
      [1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346729.h...
- MobiusHorizons 20 hours ago
  Almost all performance critical pieces of c/c++ libraries (including things as seemingly mundane as strlen) use specialized hand written assembly. Compilers are good enough for most people most of the time, but that’s only because most people aren’t writing software that is worth optimizing to this level from a financial perspective.
- mhh__ 22 hours ago
  Compilers are extremely good considering the amount of crap they have to churn through but they have zero information (by default) about how the program is going to be used so it's not hard to beat them.
  [-]
  - haiku2077 22 hours ago
    If anyone is curious to learn more, look up "profile-guided optimization" which observes the running program and feeds that information back into the compiler
- asveikau 19 hours ago
  It's SIMD specifically. SIMD is crazy fast and compilers are still not good at generating it.
  [-]
  - globular-toast 15 hours ago
    There are other things too like using the carry bit directly with ADC instead of using "tricks" to check for overflow before/after it happens for example.
jabjoe 16 hours ago
Interesting. The one time I ended up writing assembly was because of SIMD. I was speaking about it recently and had forgotten it was because SIMD instructions, so nice to be reminded. However, that one time, I also ended up finding the right syntactic sugar, to get the compiler to do it right. If I remember right it was all down to aliasing. I had to convince the compiler the data wasn't going to be accessed any other place. It wasn't working that out itself, so it didn't know it could use the SIMD instruction I was. Once I found this and the right non-standard extra key words, the compiler would do it right. So I ended removing the assembly I had written.
jauntywundrkind 22 hours ago
Kind of reminds me of Sound Open Firmware (SOF), which can compile with e8ther unoptimized gcc, or using the proprietary Cadence XCC compiler that can can use the Xtensa HiFi SIMD intrinsics.
https://thesofproject.github.io/latest/introduction/index.ht...
Ygg2 11 hours ago
To be fair to the dev he said he sped up one function 100x.
That doesn't imply he speed up the program. Ironically speeding up parts of code may decrease overall performance due to resource contention.
As the saying goes. There are lies, damned lies, statistics, and then benchmarking.
bravesoul2 13 hours ago
100x *
* not 100x
shmerl 1 day ago
Still waiting for Pipewire + xdg desktop portal screen / window capture support in ffmpeg CLI. It's been dragging feet forever with it.
486sx33 20 hours ago
[dead]
askvictor 1 day ago
[flagged]
[-]
- Arubis 23 hours ago
  This intrinsically feels like the opposite of a good use case for an LLM for code gen. This isn’t boilerplate code by any means, nor would established common patterns be helpful. A lot of what ffmpeg devs are doing at the assembly level is downright novel.
- pizlonator 1 day ago
  The hardest part of optimizations like this is verifying that they are correct.
  We don’t have a reliable general purpose was of verifying if any code transformation is correct.
  LLMs definitely can’t do this (they will lie and say that something is correct even if it isn’t).
  [-]
  - viraptor 1 day ago
    But we do! For llvm there's https://github.com/AliveToolkit/alive2 There are papers like https://people.cs.rutgers.edu/~sn349/papers/cgo19-casmverify... There's https://github.com/google/souper There's https://cr.yp.to/papers/symexemu-20250505.pdf And probably other things I'm not aware of. If you're limiting the scope to a few blocks at a time, symbolic execution will do fine.
    [-]
    - pizlonator 23 hours ago
      > limiting the scope to a few blocks
      Yeah so like that doesn’t scale.
      The interesting optimizations involve reasoning across thousands of blocks
      And my point is there is no reliable general purpose solution here. „Only works for a few blocks at a time” is not reliable. It’s not general purpose
      [-]
      - viraptor 15 hours ago
        No optimisation currently done in production compilers (that I know of) work on more than a few blocks. I haven't seen anyone claim they're not general purpose or don't scale yet, or are not reliable.
        Meanwhile the patch we're discussing is 2 functions, 3 blocks each. So in this context we're not talking about anything crazy.
        [-]
        pizlonator 8 hours ago
        Examples of optimization passes in llvm that reason across arbitrary numbers of blocks:
        - SROA
        - GVN
        - regalloc
        And those are just the heaviest hitters. In fact most optimization passes are capable of performing changes that impact many blocks at once.
        [-]
        viraptor 4 hours ago
        Eh, kinda? I mean regalloc is not an optimisation pass. And SROA/GVM go through whole functions, but they don't really optimise across blocks. They do local changes only with information from previous blocks available. I meant that I don't know of anything that makes modifications across multiple blocks as one complex change. Even licm pokes one thing at a time from one block to a different one. There's no complex "let's replace this kind of cross-block pattern with a different equivalent code that changes multiple blocks at the same time".
        [-]
        pizlonator 3 hours ago
        SROA absolutely does a coordinated modification across an arbitrary number of blocks whenever it kills an alloca.
        GVN has cases where it’ll also make a coordinated change across multiple blocks
        And those are just two passes. I’m not mentioning the many other passes that have this property just for the sake of brevity.
- smj-edison 1 day ago
  I'd be worried about compile times, lol. Final binaries are quite often tens to hundreds of megabytes, pretty sure an LLM processes tokens much slower than a compiler completes passes.
  EDIT: another thought: non-deterministic compilation would also be an issue unless you were tracking the input seed, and it would still cause spooky action at a distance unless you had some sort of recursive seed. Compilers are supposed to be reliable and deterministic, though to be fair advanced optimizations can be surprising because of the various heuristics.
  [-]
  - viraptor 23 hours ago
    There's no reason to run the optimisation discovery at compile time. Anything that changes the structure can be run to change the source ahead of time. Anything that doesn't can be generalised into a typical optimisation step in the existing compiler pipeline. Same applies to Souper for example - you really don't want everyone to run it.
    [-]
    - smj-edison 22 hours ago
      I'm not quite understanding your comment, are you saying that ANNs are only useful for tuning compiler heuristics?
      [-]
      - viraptor 21 hours ago
        That too. But mainly for transforming the source ahead of time to be more optional. If there's some low level, local optimisation, that can be found once and implemented as a stable, repeatable idea in the compiler code instead.
- hashishen 23 hours ago
  It doesn't matter. This is inherently better because the dev knows exactly what is being done. Llms could cripple entire systems with assembly access
  [-]
  - gronglo 23 hours ago
    You could run it in a loop, asking it to improve the code each time. I know what the ffmpeg devs have done is impressive, but I would be curious to know if something like Claude 4 Opus could make any improvements.
    [-]
    - eukara 22 hours ago
      I think if it was easy for them to improve critical projects like ffmpeg, we'd have seen some patches that mattered already. The only activity I've seen is LLMs being used to farm sec-ops bounties which get rejected because of poor quality.
      https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
    - minimaxir 22 hours ago
      That can work with inefficient languages like Python, but not raw Assembly.
      [-]
      - saagarjha 21 hours ago
        Sure they can, it’s just that verifying the outputs is difficult.
- viraptor 1 day ago
  https://arxiv.org/html/2505.11480v1 we're getting there. This is for general purpose code, which is going to be easier than heavy SIMD where you have to watch out for very specific timing, pipelines and architecture details. But it's a step in that direction.
- LtWorf 1 day ago
  > I wonder how many optimisations like this could be created by LLMs
  Zero. There's no huge corpus of stackoverflow questions on highly specific assembly optimisations so…
  [-]
  - astrange 1 day ago
    You can run an agent in a loop, but for something this small you can already use a SAT solver or superoptimizer if you want to get out of the business of thinking about things yourself.
    I've never seen anyone actually do it, mostly because modeling the problem is more work than just doing it.
    [-]
    - ksclk 1 day ago
      > you can already use a SAT solver
      Could you elaborate please? How would you approach this problem, using a SAT solver? All I know is that a SAT solver tells you whether a certain formula of ANDs and ORs is true. I don't know how it could be useful in this case.
      [-]
      - dragontamer 1 day ago
        Pretty much all instructions at the assembly level are sequences of AND/OR/XOR operations.
        SAT solvers can prove that some (shorter) sequences are equivalent to other (longer) sequences. But it takes a brute force search.
        IIRC, these super optimizing SAT solvers can see patterns and pick 'Multiply' instructions as part of their search. So it's more than traditional SAT. But it's still... At the end of the day.... A SAT equivalence problem.
        [-]
        agent327 23 hours ago
        A short look at any compiled code on godbolt will very quickly inform you that pretty much all instructions at the assembly level are, in fact, NOT sequences of AND/OR/XOR operations.
        [-]
        dragontamer 23 hours ago
        All instructions are implemented with logic gates. In fact. All instructions today are likely NAND gates.
        Have you ever seen a WallaceTree multiplier? A good sequence that shows how XOR and AND gates can implement multiply.
        Now, if multiply + XOR gets the new function you want, it's likely better than whatever the original compiler output.
        [-]
        agent327 15 hours ago
        That was not the claim. The claim was that assembler was made up out of sequences of OR/AND/XOR, and that claim is demonstrably false.
        [-]
        astrange 13 hours ago
        "Made up of" isn't helpful. All assembly languages are equivalent to a sequence of bit operations. (…if you ignore side effects and memory operations.)
        In fact there's several different single operations you can build them all out of: https://en.wikipedia.org/wiki/One-instruction_set_computer#I...
        So you take your assembly instructions, write a sufficiently good model of assembly instructions<>bit operations, write a cost model (byte size of the assembly works as a cheap one), and then search for assembly instructions that perform the equivalent operation and minimize the cost model.
        Like here: https://theory.stanford.edu/~aiken/publications/papers/asplo...
        viraptor 9 hours ago
        You're missing forest for the trees and lacking information to discuss this really. Check out this paper and similar ones if you want to learn about this area: https://arxiv.org/pdf/1711.04422
        dragontamer 6 hours ago
        What do you think implements assembly language?
        Answer: Verilog or VHDL. And these all synthesize down to AND/OR/XOR gates and eventually converted into NAND gates only.
        Every assembly language statement is either data movement, or logic, or some combination of the two.
        -------
        We are talking about SAT solvers and superoptimizers. Are you at all familiar with this domain? Or have you even done a basic search on what the subject matter is?
        saagarjha 21 hours ago
        Computers are not programmed on the level of logic gates. If you want to do that, design an FPGA.
        [-]
        dragontamer 20 hours ago
        Superoptimizers take this stuff into consideration.
        Which is what the parent post was talking about, the rare superoptimizer.
        LtWorf 7 hours ago
        If you want something optimised… "equivalent" isn't going to do it.
        viraptor 23 hours ago
        You're missing the point. All instructions can be simplified to short integer operations, then all integer operations are just networks of gates, then all gates can be replaced with AND/OR/NOT, or even just NAND. That's why you can SAT solve program equivalence. See SMT2 programs using BV theory for example.
        Also of course all instructions are MOV anyway. https://github.com/xoreaxeaxeax/movfuscator
        [-]
        agent327 15 hours ago
        I seem to remember that program equivalence is an NP-hard problem. I very much doubt that you can solve it by reducing it to logic gates.
        [-]
        viraptor 13 hours ago
        NP-hard is just a complexity class, not a minimal threshold. SAT solving is kind of a guided optimistic bruteforcing. If the example is small enough, you can still solve it very quickly. Same as you can solve travelling salesman for a small number of cities. We solve small cases of NP-hard things all over the place.
        [-]
        viraptor 9 hours ago
        (not to confirm it's an np-hard problem - it may not even be decidable generally, but in practice yes, you can check it that way and SMT2 theories provide the right ops)
        dragontamer 9 hours ago
        That's why superoptimizers work on short sequences of code and take a long time doing so.
        valkon 22 hours ago
        [dead]
        peaboff 3 hours ago
        What you're saying is true. Yes you can grind away at generating sequences of instructions, SAT solve equivalence and benchmark, but of course, you would sooner see all black holes in the observable universe evaporate before you find an instruction sequence that is both correct AND 100x faster.
        dataangel 22 hours ago
        You're a bit naive about the complexity. Commonly longer sequences are actually faster, not just because instructions vary in their speed, but also because the presence of earlier instructions that don't feed results into later instructions still affect their performance. Different instructions consume different CPU resources and can contend for them (e.g. the CPU can stall even though all the inputs needed for a calculation are ready just because you've done too many of that operation recently). And then keep in mind when I say "earlier instructions" I don't mean earlier in the textual list, I mean in the history of instructions actually executed; you can reach the same instruction arriving from many different paths!
        LtWorf 7 hours ago
        I see a guy who has never seen assembly explain assembly to people who have written assembly and written compilers and optimisations…
  - gametorch 23 hours ago
    There are literally textbooks on optimization. With tons of examples. I'm sure there are models out there trained on them.
    [-]
    - wk_end 23 hours ago
      There are literally textbooks on computational theory, with tons of example proofs. I'm sure there are models trained on them. Why hasn't ChatGPT produced a valid P vs. NP proof yet?
      [-]
      - gametorch 22 hours ago
        I'm just countering their claim that such code is not in the training data.
        You're employing a logical fallacy to argue about something else.
    - asadotzler 19 hours ago
      It seems unlikely that a handful of text books has the volume of examples needed to make LLMs any good at this.
- hansvm 22 hours ago
  If we're considering current-gen LLMs, approximately zero. They're bad at this sort of thing even with a human in the loop.