← Package directory
Available on winget

Install llama.cpp

LLM inference in C/C++

Install with winget
winget install --id ggml.llamacpp
Upgrade
winget upgrade --id ggml.llamacpp
Uninstall
winget uninstall --id ggml.llamacpp

About llama.cpp

LLM inference in C/C++

What's new in b9873

llama : add guard for K/V rotation input when buffer is unallocated (#25215) llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input call set_input_k_rot / set_input_v_rot whenever the rotation tensor pointer is non-null, but the tensor's buffer can be unallocated (NULL) when a graph only stores K/V without attending -- e.g. DFlash speculative decoding's KV-injection pass. set_input_k_rot then calls ggml_backend_buffer_is_host() on a NULL buffer and aborts with GGML_ASSERT(buffer). Guard the four k_rot/v_rot inputs with the same "&& ->buffer" check that the adjacent kq_mask inputs already use in these two functions. When the buffer is unallocated there is no data to upload, so skipping is correct. Fixes #25191 Signed-off-by: liminfei-amd 91481003+liminfei-amd@users.noreply.github.com macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubuntu arm64 (CPU) - Ubuntu s390x (CPU) - Ubuntu x64 (Vulkan) - Ubuntu arm64 (Vulkan) - Ubuntu x64 (ROCm 7.2) - Ubuntu x64 (OpenVINO) - Ubuntu x64 (SYCL FP32) - Ubuntu x64 (SYCL FP16) Android: - Android arm64 (CPU) Windows: - Windows x64 (CPU) - Windows arm64 (CPU) - Windows arm64 (OpenCL Adreno) - Windows x64 (CUDA 12) - CUDA 12.4 DLLs - Windows x64 (CUDA 13) - CUDA 13.3 DLLs - Windows x64 (Vulkan) - Windows x64 (OpenVINO) - Windows x64 (SYCL) - Windows x64 (HIP) openEuler: - DISABLED - openEuler x86 (310p) - openEuler x86 (910b, ACL Graph) - openEuler aarch64 (310p) - openEuler aarch64 (910b, ACL Graph) UI: -...

Read release notes

Version history

Version Updated Notes
b9873 llama : add guard for K/V rotation input when buffer is unallocated (#25215) llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input call set_input_k_rot / set_input_v_rot whenever the rotation ten...
b9870 chat: trim messages sent to StepFun parser (fixes long reasoning loops) (#25238) - chat: trim messages sent to StepFun parser (fixes long reasoning loops) - add regression test; remove duplicate template - chat: trim Ste...
b9860 Unknown llama : add llama_model_ftype_name() (#25134) - llama : add llama_model_ftype_name() Expose the model file type (quantization) name, e.g. "Q8_0" or "Q4_K - Medium", through a new public C API. The returned pointer is val...
b9859 Unknown opencl: allow loading precompiled binary kernels from library (#23042) - opencl: allow loading binary kernel - opencl: add libdl.h - ggml-backend-dl is in ggml, which depends backend libs, thus ggml-opencl cannot depend...
b9852 Unknown opencl: initial q1_0 support (#25160) - opencl: general q1_0 support - opencl: add Adreno GEMM/GEMV for q1_0 macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel...
b9843 Unknown Revert "sched : reintroduce less synchronizations during split compute (#20793)" (#25138) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFram...
b9837 Unknown jinja, chat: add --reasoning-preserve flag (#25105) - jinja, chat: add --reasoning-preserve flag - correct help message macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED -...
b9828 Unknown opencl: flash attention improvement (#25069) - opencl: rework FA kernel for f16 and f32 - opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_m...
b9803 Unknown opencl: flush profiling batch at shutdown for incomplete batches (#25016) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - U...
b9787 Unknown sycl : fix the failed UT cases of conv_3d (#24900) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubunt...
b9776 Unknown vulkan: Apply bias before softmax in FA, to avoid overflow (#24909) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu...
b9763 Unknown server : Add id to tool call responses api (#24882) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubun...
b9754 Unknown common/peg : implement ac parser for stricter grammar generation (#24869) - common/peg : implement ac parser - cont : extract functions - cont : tidy up - cont : remove a test - cont : move ac() def macOS/iOS: - macOS Ap...
b9744 Unknown common/peg : refactor until gbnf grammar generation (#24839) - common/peg : refactor until gbnf grammar into an ac automaton - cont : add a test with multiple strings - cont : pad state with 0s so rules line up - cont :...
b9733 Unknown ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU...
b9717 Unknown ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753) - ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path...
b9693 Unknown metal : check for BF16 support in concat kernel (#24747) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) -...
b9673 Unknown sycl: Add optional USM system allocations (#22526) This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the syste...
b9637 Unknown chat: add dedicated Cohere2MoE (North Code) parser (#24615) - chat: add dedicated Cohere2MoE (North Code) parser - Some renames to make @CISC happy :> macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64...
b9628 Unknown add sycl to check-release (#24583) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubuntu arm64 (CPU) -...