14× faster embeddings: how we rebuilt the ONNX path in Manticore

(manticoresearch.com)

47 points | by snikolaev 5 hours ago

3 comments

ducviet00 2 hours ago
Unlike GPUs, CPUs aren't designed for massive parallelism. Because of this, batching inference won't necessarily give you a speed boost here. In fact, it can actually slow the process down.
Instead, I'd recommend exploring CPU-specific AI optimizations. For instance, leveraging AVX512_BF16 instructions could reduce the inference time by 2x or 3x compared to the results in the article. OpenVINO supports this really well on Intel CPUs, and converting an ONNX model to OpenVINO is straightforward.
[-]
- properbrew 1 hour ago
  +1 for OpenVINO, we utilise it for our model. It's quite amazing the inference speed you can get from CPUs that most people would assume are running on a GPU.
- electroglyph 1 hour ago
  ONNX has AVX512 CPU kernels too, and openvino uses ONNX internally (and ONNX supports openvino backend)
  [-]
  - ducviet00 29 minutes ago
    > openvino uses ONNX internally
    OpenVINO only uses ONNX to parse the model, not to execute it. It runs computations through its own highly optimized inference engine specifically designed for Intel hardware. It doesn't rely on the ONNX engine at all, and it will even automatically convert eligible model weights to BF16 for you
minimaxir 2 hours ago
We really need a replacement for all-MiniLM-L12-v2 that can create more robust embeddings with the same compute.
You can technically do Q4 quantization for larger embedding models but I am not sure if that plays nice with ONNX.
[-]
- electroglyph 1 hour ago
  it's a pain in the ass to do properly.
  what we really need it something like auto-round for ONNX
electroglyph 3 hours ago
ONNX is my first suggestion to people looking for speed gains on CPU