winget install --id ggml.llamacpp
About llama.cpp
LLM inference in C/C++
What's new in b9873
llama : add guard for K/V rotation input when buffer is unallocated (#25215) llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input call set_input_k_rot / set_input_v_rot whenever the rotation tensor pointer is non-null, but the tensor's buffer can be unallocated (NULL) when a graph only stores K/V without attending -- e.g. DFlash speculative decoding's KV-injection pass. set_input_k_rot then calls ggml_backend_buffer_is_host() on a NULL buffer and aborts with GGML_ASSERT(buffer). Guard the four k_rot/v_rot inputs with the same "&& ->buffer" check that the adjacent kq_mask inputs already use in these two functions. When the buffer is unallocated there is no data to upload, so skipping is correct. Fixes #25191 Signed-off-by: liminfei-amd 91481003+liminfei-amd@users.noreply.github.com macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubuntu arm64 (CPU) - Ubuntu s390x (CPU) - Ubuntu x64 (Vulkan) - Ubuntu arm64 (Vulkan) - Ubuntu x64 (ROCm 7.2) - Ubuntu x64 (OpenVINO) - Ubuntu x64 (SYCL FP32) - Ubuntu x64 (SYCL FP16) Android: - Android arm64 (CPU) Windows: - Windows x64 (CPU) - Windows arm64 (CPU) - Windows arm64 (OpenCL Adreno) - Windows x64 (CUDA 12) - CUDA 12.4 DLLs - Windows x64 (CUDA 13) - CUDA 13.3 DLLs - Windows x64 (Vulkan) - Windows x64 (OpenVINO) - Windows x64 (SYCL) - Windows x64 (HIP) openEuler: - DISABLED - openEuler x86 (310p) - openEuler x86 (910b, ACL Graph) - openEuler aarch64 (310p) - openEuler aarch64 (910b, ACL Graph) UI: -...
Version history
| Version | Updated | Notes |
|---|---|---|
| b9873 | llama : add guard for K/V rotation input when buffer is unallocated (#25215) llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input call set_input_k_rot / set_input_v_rot whenever the rotation ten... | |
| b9870 | chat: trim messages sent to StepFun parser (fixes long reasoning loops) (#25238) - chat: trim messages sent to StepFun parser (fixes long reasoning loops) - add regression test; remove duplicate template - chat: trim Ste... | |
| b9860 | Unknown | llama : add llama_model_ftype_name() (#25134) - llama : add llama_model_ftype_name() Expose the model file type (quantization) name, e.g. "Q8_0" or "Q4_K - Medium", through a new public C API. The returned pointer is val... |
| b9859 | Unknown | opencl: allow loading precompiled binary kernels from library (#23042) - opencl: allow loading binary kernel - opencl: add libdl.h - ggml-backend-dl is in ggml, which depends backend libs, thus ggml-opencl cannot depend... |
| b9852 | Unknown | opencl: initial q1_0 support (#25160) - opencl: general q1_0 support - opencl: add Adreno GEMM/GEMV for q1_0 macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel... |
| b9843 | Unknown | Revert "sched : reintroduce less synchronizations during split compute (#20793)" (#25138) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFram... |
| b9837 | Unknown | jinja, chat: add --reasoning-preserve flag (#25105) - jinja, chat: add --reasoning-preserve flag - correct help message macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED -... |
| b9828 | Unknown | opencl: flash attention improvement (#25069) - opencl: rework FA kernel for f16 and f32 - opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_m... |
| b9803 | Unknown | opencl: flush profiling batch at shutdown for incomplete batches (#25016) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - U... |
| b9787 | Unknown | sycl : fix the failed UT cases of conv_3d (#24900) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubunt... |
| b9776 | Unknown | vulkan: Apply bias before softmax in FA, to avoid overflow (#24909) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu... |
| b9763 | Unknown | server : Add id to tool call responses api (#24882) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubun... |
| b9754 | Unknown | common/peg : implement ac parser for stricter grammar generation (#24869) - common/peg : implement ac parser - cont : extract functions - cont : tidy up - cont : remove a test - cont : move ac() def macOS/iOS: - macOS Ap... |
| b9744 | Unknown | common/peg : refactor until gbnf grammar generation (#24839) - common/peg : refactor until gbnf grammar into an ac automaton - cont : add a test with multiple strings - cont : pad state with 0s so rules line up - cont :... |
| b9733 | Unknown | ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU... |
| b9717 | Unknown | ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753) - ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path... |
| b9693 | Unknown | metal : check for BF16 support in concat kernel (#24747) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) -... |
| b9673 | Unknown | sycl: Add optional USM system allocations (#22526) This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the syste... |
| b9637 | Unknown | chat: add dedicated Cohere2MoE (North Code) parser (#24615) - chat: add dedicated Cohere2MoE (North Code) parser - Some renames to make @CISC happy :> macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64... |
| b9628 | Unknown | add sycl to check-release (#24583) macOS/iOS: - macOS Apple Silicon (arm64) - macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED - macOS Intel (x64) - iOS XCFramework Linux: - Ubuntu x64 (CPU) - Ubuntu arm64 (CPU) -... |