* ci : remove tag from build-self-hosted.yml
* ci : slim -> self-hosted
* ci : prevent heavy CPU jobs from running on fast runners
* ci : prevent cmake pkg to run on dedicated fast runners
* ci : try to bump 3.11 -> 3.13
* ci : move lint back to 3.11
* ci : back to 3.11
* ci : add comment about UI jobs
* ci : move python requirements check to CPU runners
this job is a bit slow for a dedicated "fast" runner
* ci : add self-hosted ui workflow
* ci : fix UI naming
* tmp to check if arm64 fast is compatible with all jobs
* revert last commit
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf
The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.
Fixes#23408
* requirements: bump torch floor to 2.11.0 per maintainer
* requirements: pin torch to ==2.11.0 per project policy
* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy
* requirements: suppress check_requirements pin warning on mtmd
The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.
* ty: silence Tensor/Module union check on model[0].auto_model
With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup
BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.
Fixes#20478
* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0
The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.
Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.
Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* pi : update
* ci : fix ios build
* ci : fix andoroid
* ci : fix apple builds
* cmake : add install() for impl libraries
Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:
llama-server: error while loading shared libraries: libllama-server-impl.so
Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515
Assisted-by: llama.cpp:local pi
* ci : fix xcframework build
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control
Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.
Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.
Assisted-by: llama.cpp:local pi
* cmake : enable LLAMA_BUILD_APP by default
Assisted-by: llama.cpp:local pi
* ci : disable app in build-cmake-pkg.yml
* vulkan: fuse snake activation (mul, sin, sqr, mul, add)
Add snake.comp shader with F32 / F16 / BF16 pipelines and
ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op
decomposition emitted by audio decoders (BigVGAN, Vocos) for snake
activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single
elementwise kernel.
test_snake_fuse from the CUDA PR now also compares CPU naive vs
Vulkan fused across F32 / F16 / BF16.
* vulkan: address jeffbolznv review for fused snake activation
Rename T / C to ne0 / ne1 in the shader and push constants to match
the standard naming convention used across the Vulkan backend.
Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous
(the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be
tightly packed on the broadcast dim (the shader reads data_a[i1]).
* vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review)
* vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review)
* vulkan: address 0cc4m review for fused snake activation
snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention.
A_TYPE now applies to the activation tensor data_a instead of the
broadcast multiplier, and the bindings become data_a (A_TYPE), data_b
(float), data_c (float) and data_d (D_TYPE). A header at the top of
the shader maps each buffer to its role in y = x + sin(b * x)^2 * c.
On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern
constant instead of duplicating the op list, sin_node is extracted as a
named local alongside the other chain nodes, and the broadcast operands
a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded
float bindings on data_b and data_c (the previous a->type == x->type
would silently reject any future BF16 or F16 chain once the supports_op
gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an
explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the
silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1]
is refreshed to match the new binding names.
* tests : move save-load-state from examples to tests
- Move examples/save-load-state/ to tests/test-save-load-state.cpp
- Remove subdirectory reference from examples/CMakeLists.txt
- Add test to tests/CMakeLists.txt as a model test
- Remove CODEOWNERS entry for removed example directory
Assisted-by: llama.cpp:local pi
* cont : update ci
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).
For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.
Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.
ref: https://github.com/ggml-org/llama.cpp/issues/23395
Assisted-by: llama.cpp:local pi
* vocab : add Carbon-3B (HybridDNATokenizer) support
Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the
HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}.
The base BPE is Qwen3-4B-Base's; what differs is that text inside
<dna>...</dna> regions is chunked into fixed 6-mers (right-padded
with 'A' on the trailing partial), and any base outside ACGT maps
to <oov>.
* src/llama-vocab.{h,cpp}: new pre-type, dispatched from
llm_tokenizer_bpe_session::tokenize.
* src/llama-vocab-carbon.h: pure helpers (tokenize_carbon,
emit_dna_kmers) factored out for unit testing — no llama_vocab
dependency, vocab access goes through a std::function.
* conversion/base.py: detect HybridDNATokenizer by class name in
get_vocab_base_pre (chktxt collides with Qwen3 base since it
has no <dna>), and pass trust_remote_code=True in get_vocab_base
so the custom tokenizer class can load.
* tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer,
multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions, vocab miss.
* vocab : align Carbon-3B changes with llama.cpp conventions
* Fold tokenize_carbon + emit_dna_kmers inline into
llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h),
matching how every other tokenizer keeps its helpers inside
llama-vocab.cpp.
* Replace the standalone unit test with the conventional
test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf
(vocab-only conversion) + .inp/.out fixtures covering single
6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions.
* Register "carbon" in convert_hf_to_gguf_update.py's model list
(pointing at HuggingFaceBio/Carbon-3B) and teach both
AutoTokenizer call sites in the updater to pass
trust_remote_code=True for it, matching how t5 is special-cased.
* vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch
Refactor the conversion-side changes to follow the per-tokenizer-family
convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm,
etc. instead of conditionalising the shared get_vocab_base /
get_vocab_base_pre paths.
* conversion/base.py: add _set_vocab_carbon — self-contained, loads
with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA
vocab is visible, writes tokenizer.ggml.pre = "carbon" directly.
* conversion/llama.py: branch in LlamaModel.set_vocab on
tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and
dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py
(tokenizer_class branch between BertTokenizer / RobertaTokenizer) and
conversion/phi.py.
* conversion/base.py: revert the conditional in get_vocab_base and the
class-name short-circuit in the auto-generated get_vocab_base_pre.
* tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples
Add 6 cases from the Carbon-3B model card on top of the existing edge
coverage: the unterminated basic-completion prompt, the closed 33-bp
example, the metadata-conditioned prompt (with <vertebrate_mammalian>
and <protein_coding_region> which BPE-decompose since they are not in
the vocab), the documented anti-pattern of raw DNA without <dna> tags,
and the two likelihood-scoring examples. Brings the suite to 19 cases.
* vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE
Refactor per upstream review:
> This should be its own tokenizer model, ie. carbonhybriddna instead
> of gpt2 and not carbon pre-tokenizer. That way you can keep the
> correct pre-tokenizer, in case that ever changes.
Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a
new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific
branch inside llm_tokenizer_bpe_session::tokenize (only existing
pre-types differ in regex, not dispatch logic), and (b) conflated
"hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer".
This change moves it to its own vocab type, peer to PLAMO2, with the
GGUF model name matching the HF tokenizer class (HybridDNATokenizer):
* include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7.
* src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that
owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and
routes raw text through a DNA-aware splitter; wired into
init_tokenizer, tokenize, type_name, byte_to_token, and the
BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov>
are pure ASCII, so byte-level BPE decoding handles them).
LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type
config block alongside SPM/WPM/UGM/RWKV, where pre_type is set
to QWEN2 and the matching add_space_prefix / escape_whitespaces /
clean_spaces flags are applied — mirroring qwen2's BPE path so
byte-level BPE merging stays bit-identical to the Python
reference for non-DNA text.
* src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON.
* conversion/base.py: _set_vocab_hybriddna writes
tokenizer.ggml.model = "hybriddna" (no separate pre).
* conversion/llama.py: dispatch on tokenizer_class ==
"HybridDNATokenizer" same as bert.py / phi.py do.
* models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture +
regenerated metadata.
* convert_hf_to_gguf_update.py: drop the stale chkhsh entry and
trust_remote_code special-case (no longer needed since dispatch
is now class-name driven, not chkhsh).
Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}:
tokenization is bit-identical to the Python HybridDNATokenizer for
all 19 test fixtures plus the model-card metadata-conditioned
prompt; greedy completion produces the same DNA continuation as
the Python reference; spec-dec with 500M as draft for 8B still
works.
* vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA
* vocab : drop llm_tokenizer_bpe vocab-type assert
* vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch
* vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe
* vocab : annotate #endif with PRETOKENIZERDEBUG
* vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)
* deduplicate
* simplify
* simplify
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4),
the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs,
self_kq_mask) are created as graph input nodes but never consumed by any compute node,
so the backend scheduler never allocates a buffer for them. Calling
mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits
GGML_ASSERT(buffer) at ggml-backend.cpp:194.
The same scenario applies symmetrically: if a model had zero SWA layers, the SWA
tensors would be unallocated.
Fix: guard both the base and SWA set_input calls with null/buffer checks, matching
the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674)
which has the comment: 'base tensors may not be allocated if there are no non-SWA
attention layers'.
Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for
unallocated tensors, preventing a null-dereference on the reuse path.
* hexagon: remove gathers and better handling of vtcm in ssm-conv
* hexagon: relax ssm-conv gating requirements
* hexagon: add new prefill ssm-conv backend test
* hexagon: remove trailing white space
* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
- HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference.
- Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch
* webui: Add max image size option
* remove magic numbers
* support all image formats
* use const
* Move regex to match b64 images to constants
* use SETTINGS_KEYS to get max image resolution setting
* Do not touch the image if already under the size threshold
* Move to backend sampling for MTP draft path
Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits
Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.
* Allow sampler chains to be partially offloaded to backend
* Add --spec-draft-backend-sampling argument. Enabled by default.
* opencl: refactor initialization
* opencl: refactor GPU identification
* opencl: rename for consistency
* opencl: cache global mem size in dev_ctx
* opencl: adjust log level
* opencl: load argsort and flash_attn kernels in supports_op
* argsort kernel must be built for supports_op for querying the max
workgroups
* flash_attn kernel has many variants, only load them when needed
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.
Assisted-by: llama.cpp:local pi
* mtmd : deepseek-ocr fixes, improvements and refactoring
- image processing changes to achieve full parity with Pillow (reference impl)
- SAM mask casting only when flash-attn is on
- SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it)
- llama-chat changes to fix server/WebUI issue (new media_markers_first())
- adapted test-chat-template and added test cases for deepseek-ocr
- changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model
- ty.toml ignore unresolved-import for tools/mtmd/tests/**
* image-text reordering fix removed
* refactor bool add_padding + pad_rounding enum into a single pad_style enum
* hmx-mm: update debug logging in hmx-mm
* hmx-mm: update dequant logic to use HVX_vector_x2/4
* hmx-mm: remove non-pipelined version of the quantize matmul
It seems that we don't reall need non-pipelined version
* hmx-mm: use activation depth mode and update naming
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* hex-mm: minor hmx matmul naming updates
* hmx-mm: remove unused vars
* snapdragon: scripts bump default ubatch-size to 1K
* hexagon: combine HMX and power and clock settings into a single set_power call
* hmx-mm: remove leftover of the scale repl helper
* hexagon: fix editconf error
---------
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* Adds initial PDL setup.
* Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.
* Further optimization pass of the first half of kernels
* Optimized PDL barriers for the second batch of kernels
* Further refinements after rebase.
* Moves pdl logic to separate function, removes some whitespace
* Strips post-hoc PDL logic
* Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
overlap execution with previous kernels
* Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL
* Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL
* Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
to enable hip/musa compatibility
* Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32
* Enrolls flash_attn_combine_results
* Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
works or is without effect.
* Enrolls flash-attention kernels to pdl
* Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
kernels args. This fixes PDL.
* Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
and template alias and template expansion
* Enrolls all remaining kernels for qwen3-coder-next into PDL
* Remove all PDL LC calls to create a baseline
* Added LC according to internal guidance and tested kernel performance.
* Enrols missing qwen3-5 kernels passively into PDL.
* Kernel optimizations (LC signals) for qwen3.5
* Enrolls ssm-scan kernels into PDL
* Adds GGML_CUDA_PDL command line option to toggle PDL.
* Fix: Ada and lower compilation by guarding PDL calls correctly
* Cleanup: Removes commented out GGML_CUDA_PDL_LC
* Cleanup: Removes experimental comments
* Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.
* Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.
* Fix: Correct PDL en/disablement based on device-side arch check. Host
side check is UB. Required moving from macros to inlined functions
* Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1
* Enable PDL by default for Hopper+ devices
* Enrolls softcap_f32 and two flash_attn kernels into PDL.
* Improves flash attn PDL barrier placement
* Fix: Perf regression on ada; excludes ada and below from PDL launches
* Improves some sync barrier placements
* Drops superfluous constructor
* Adds #endif guard comments
* Reverts experimental change to top-k-moe.cu, which moved expensive allocations
in front of the PDL barrier. It did not have a meaningful impact.
* Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
PDL is disabled
* Revert "Drops superfluous constructor". Adds const to remaining
arguments
This reverts commit 12b1d250da.
* Cleanup: Removes and fixes some comments and whitespace
* Clarifies comment of sync-barrier position
* Relocates and refactors PDL launch functions and accessories
* Adds error checking to the regular kernel launch path
* Drops "auto" in favor of "ggml_cuda_kernel_params"
* Adds "const" to ggml_cuda_kernel_launch_params
* [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy
* mtmd: fit_params now take into account mmproj
* rename alloc_compute_meta to reserve_compute_meta
* rm unused functions
* add ggml_backend_dev_t support
* add debug log
* snapdragon: update compiler flags to enable all CPU features
* snapdragon: update readme to point to toolchain v0.6
* snapdragon: bump toolchain docker to v0.6
* opencl: add q4_k moe support
* opencl: add q5_k moe support
* opencl: add q6_k moe support
* opencl: adjust format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
This commit attempts to clarify a code comment in graph_mtp regarding
where the MTP layer is stored.
The motivation for this is that it was not obvious to me what the
original comment meant and hopefully this makes it clearer.
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().
Assisted-by: llama.cpp:local pi
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* save-load-state : refactor into separate phase functions
- Split monolithic main() into 4 self-contained phase functions, each
managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)
Assisted-by: llama.cpp:local pi
* save-load-state : use params.out_file instead of separate state_file
- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt
Assisted-by: llama.cpp:local pi
* save-load-state : use smart pointers for ctx and smpl
- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)
Assisted-by: llama.cpp:local pi
* save-load-state : add local llama_batch_ptr RAII wrapper
- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls
Assisted-by: llama.cpp:local pi
* save-load-state : replace printf/fprintf with logging macros
- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG
Assisted-by: llama.cpp:local pi
* save-load-state : refactor tests to check results inline
Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().
Assisted-by: llama.cpp:local pi
* save-load-state : improve test output readability
Add phase labels, remove redundant run prefixes, and show
PASS after each test.
Assisted-by: llama.cpp:local pi
* pi : add rule about git signing
* save-load-state : simplify llama_batch_ptr
Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.
Assisted-by: llama.cpp:local pi
* save-load-state : extract generate_tokens helper
Factor out the repeated token generation loop into a shared
helper function used by all phases.
Assisted-by: llama.cpp:local pi
* save-load-state : update comments to use test terminology
Replace "Phase" with "Test" and list each test's steps
as bullet points.
Assisted-by: llama.cpp:local pi
* save-load-state : rename test functions
Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.
Assisted-by: llama.cpp:local pi
* pi : add rule to never git push without confirmation
Assisted-by: llama.cpp:local pi
* common : add model_only option to common_init_from_params
Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.
Use in save-load-state to initialize only the model,
with each test creating its own context.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* llama-eval : add per-problem summary table to HTML reports
- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
- Columns: Problem (zero-padded), Runs, Correct (n/r),
Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
- Sorted by problem index, monospace font, right-aligned numbers
- Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
template-aware question matching, --dataset arg (aime/aime2025)
Assisted-by: llama.cpp:local pi
* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally
- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar
Assisted-by: llama.cpp:local pi
* llama-eval : redesign top bar as CSS grid label/value pairs
- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns
Assisted-by: llama.cpp:local pi
* llama-eval : use realistic token counts and throughput in simulator
- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps
Assisted-by: llama.cpp:local pi
* llama-eval : color Answer column green/red based on correctness
Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.
Assisted-by: llama.cpp:local pi
* llama-eval : fix pyright errors from max(..., key=len) type inference
Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
- unresolved-attribute: Object of type Sized has no attribute lower
- not-subscriptable: Cannot subscript object of type Sized
Assisted-by: llama.cpp:local pi
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.
closes: #23242
* ggml-hexagon: add PAD op HVX kernel
Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.
* hex-ggml: remove duplicate op cases (merge conflict)
* hex-pad: fix editorconfig checks and macro alignment
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* docker: add OCI image labels to all published images
* docker: propagate OCI labels as manifest and index annotations
* docker: drop hardcoded org URL and revert accidental intel version bump
The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.
* docker: add skip_s390x workflow_dispatch input for fast test runs
Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
---------
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars
* refactor: skip MCP proxy probe when no server requires it
* refactor: suppress expected disconnect errors during MCP client shutdown
* refactor: Deduplicate requests
* refactor: deduplicate model fetching across ROUTER and MODEL modes
* refactor: Clean up models logic
* chore: Add `.env.example` file
* refactor: replace client-side CORS proxy probe with server status flag
* refactor: Post-review fixes
* test: add vitest client setup with API fetch mocks
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI
For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).
This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.
* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers
This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).
The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.
Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).
Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.
See 59778f0196.
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The MUL_MAT test loop iterates over base_types[] to generate non-contig
permutation cases (3 standard permutations across n in {1, 8, 16}).
BF16 is absent from base_types[], so these 9 cases were never generated
for BF16 even though every other type covered by base_types[] tests them.
Add the missing 9 cases explicitly: BF16 x F32, m=16, k=256, bs=[2,3],
permutations {0,2,1,3}, {0,1,3,2}, {0,3,2,1}, with n in {1, 8, 16}.
Suggested-by: @jeffbolznv
* move conversion code to a dedicated conversion directory and split the files akin to the src/models architecture
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Add Aime2026Dataset class loading from MathArena/aime_2026 on
HuggingFace. 30 problems (two sets of 15), single config/split.
Usage: --dataset aime2026
Assisted-by: llama.cpp:local pi
* Support for Codex CLI by skipping unsupported Responses tools
* Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection
* Revert gpt-oss apply_patch special handling
Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.
I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.
* update test scripts
* align CI behavior between linux and android
* remove automatically cancel in 15min
* enable cancel-in-progress
* fix ty check issue
* update and fix pylint issue
* update runner such that we are not restricted by the 15min limit rule
* fix flake8 lint issue
* update runner according to review feedback
* code update according to review feedback
* switch from llama-cli to llama-completion binary with -no-cnv flag
* ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size.
* ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap
* fix: Propagate version tag to WebUI asset download in self-hosted CI
* refactor: Apply suggestions from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix: Skip npm build when Node.js is not installed
Avoid 'no such file or directory' errors on CI runners that lack
Node.js. Check if npm is available via find_program before attempting
npm install + npm run build. Falls back to HF Bucket download.
* fix: Use + separator for ASSETS list to fix Windows build
Replace fragile \; escaping with a + separator when passing the
WebUI asset list via -DASSETS to the download script. On Windows,
the \; escaping was not reliably preserved through the CMake build
system, causing all asset filenames to be concatenated into one
(e.g., 'index.html;bundle.js;bundle.css;loading.html' as a single
file), which broke the HF Bucket download and subsequent xxd.cmake
step.
+ is safe because it is not special in cmd.exe (unlike | which is a
pipe operator), not special in CMake's -D argument parser, and not
a valid Windows filename character. CMakeLists.txt joins assets
with + and webui-download.cmake splits them back via regex.
* fix: Validate HF_WEBUI_VERSION environment variable with regex
Add input validation for the HF_WEBUI_VERSION env var to prevent
CMake list separator or path-traversal issues in stamp filenames
and download URLs. Rejects non-conforming characters early.
* fix: Remove 'latest' fallback for HF_WEBUI_VERSION
When needs.determine-tag.outputs.tag_name is empty, let CMake's
default resolution handle it (empty -> git-based version lookup)
instead of falling back to 'latest'. This ensures the sentinel
stamp file is consistent with CMake's resolution logic.
* fix: Demote checksum verification failure to warning instead of hard gate
* fix: End line character
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes#21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes#21919.
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations
Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.
On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.
All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.
* SYCL: address review feedback - remove try/catch, check device types, deduplicate
- Remove try/catch from malloc/free/memcpy helpers, check backend and
device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
in CMakeLists.txt (co-authored with @arthw)
* SYCL: add build/runtime flags for Level Zero, address review feedback
Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.
- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
Zero code is wrapped in #ifdef so the build works on systems without
the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
loader library and headers are checked before enabling.
- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
whether Level Zero or SYCL memory APIs are used. Only one API style is
used per session, no mixing. If Level Zero is enabled but the devices
don't support the Level Zero backend, it auto-disables with a warning.
- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
is not called anywhere in the backend) and used try/catch for flow control.
- Update SYCL.md with documentation for both new parameters.
Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.
* SYCL: unify Level Zero malloc/free call sites, address review feedback
Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.
Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths
Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.
Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):
- The Level Zero backend check was iterating all SYCL devices
including CPU. The OpenCL CPU device caused Level Zero to be
disabled for the GPUs, defeating the fix on multi-GPU systems.
Added is_gpu() filter so only GPU devices are checked.
- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
were still calling sycl::malloc/sycl::free directly, bypassing the
Level Zero path. Routed through ggml_sycl_malloc_device/free_device
for consistency with the other device memory call sites.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: address arthw review feedback on Level Zero memory API structure
- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording
AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* SYCL: remove unused cstdio/cstdlib includes from common.cpp
Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
* SYCL: preserve Level Zero allocation path during early malloc
* ci: fix Level Zero package conflict in Intel Docker build
* ci: find Level Zero loader in oneAPI package step
* ci: allow Windows SYCL package without Level Zero DLL
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
* opencl: add q5_0 moe support
* opencl: add q5_1 moe support
* opencl: avoid potential leak
* opencl: suppress unused var warning when building for non-Adreno
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* server, webui: accept continue_final_message flag for vLLM API compat
Add the continue_final_message body flag from the vLLM and transformers
API. When set together with add_generation_prompt false, it triggers the
existing prefill_assistant code path, regardless of the server side
opt.prefill_assistant option. Mutual exclusion with add_generation_prompt
true is enforced, matching vLLM behavior.
WebUI sends continue_final_message and add_generation_prompt false on
the Continue button, with the matching opt in option on the chat service.
Pure API alignment, no change to the prefill logic itself. Paves the way
for the upcoming per-template prefill plumbing in common/chat.
* test: add coverage for continue_final_message vLLM compat flag
Two cases on top of the existing assistant prefill coverage. First,
continue_final_message true with add_generation_prompt false produces
the same rendered prompt as the prefill_assistant heuristic, proving
the new flag is a correct alias of the existing path. Second, both
flags set to true is rejected with HTTP 400, matching the
vLLM/transformers mutual exclusion contract.
* chore: update webui build output
* server, webui : support continue generation on reasoning models (#22727)
Remove the throw blocking assistant prefill on reasoning models and
orchestrate thinking tags around the prefilled message so the parser
routes the next stream chunks correctly. WebUI drops the reasoning
guard on the Continue button, sends reasoning_content with the
prefilled message and persists partial reasoning on stop so the CoT
survives reload and resume.
Scope : templates with a simple thinking_start_tag / thinking_end_tag
pair. Channel-based templates like GPT-OSS are out of scope, pending
a per-template prefill API in common/chat.
First step toward #21754.
* chore: update webui build output
* server: reject reasoning prefill on channel based templates
* ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled)
* ggml-zendnn : restore original fallback logic when adaptive fallback is disabled
* hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase
* hmx-mm: optimize per-group scale handling
* hmx-fa: optimize slope load from vtcm
* hmx-fa: use aligned access where possible in hmx-utils
* hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32
* fix(unary): correct the gelu, gelu quick and gelu erf functions
* fix(flash-attn-tile): fix the hardcode v type
* fix(flash_attn): fix tile path
* fix: pass editorconfig and address the type conflicts
* fix: remove reduant pipeline keys
* fix: remove inline min/max group size functions and revert the flash attn path order
* fix: use clamp to avoid NaN for GELU
* fix: use the right range for exp, 80 is safer for f32 exp
* model-conversion : add causal-convert-mmproj target [no ci]
This commit adds a new Make target that only converts the mmproj model.
The motivation for this that the causal-convert-mm-model target will
convert both the test model and the mmproj model which is nice when the
model model conversion is finalized. But during development it is nice
to be able to just convert the mmproj model and not have to wait for
the often more time consuming text model conversion.
* add path model path validation check
* working llama-eval mc and math suite
* multi source llama-eval
* Add readme
* add checkpointing
* examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:
- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting
Also includes test scripts and documentation for testing and understanding
the simulator functionality.
* examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
* docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
* examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
* docs: remove README.md from llama-eval
* examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
* examples: remove HF_HUB_OFFLINE to allow dataset download
* examples: use cached dataset path to avoid HF Hub requests
* examples: use cached dataset path in simulator to avoid HF Hub requests
* docs: update llama-eval-discussion.md with session work summary
* examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
* docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
* examples: add task summary table to llama-eval-new.py
* eval : print progress
* eval : add prompts
* test : fix path
* sim : fix answer matching
* eval : support multiple dataset runs
* minor
* improve grader
* docs
* remove old files
* datasets : add gsm8k
* add gpqa + sampling + docs
* rename
* grader : improve example answers
* cont
* datasets : add aime2025
* grader : update prompt
* grade : improve regex + logs
* datasets : fix aime2025
* cleanup
* add AGENTS.md
* ignore errors
* resume eval
* cleanup
* fix counts
* simplify
* fix prompts
* add html
* store full response
* add tokens
* resoning and error handling
* refactor
* track total time
* remove junk
* eval : unify "judge" terminology to "grader"
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).
Assisted-by: llama.cpp:local pi
* eval : add Wilson score confidence interval to results
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
* llama-eval : add per-task generation speed from server timings
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : add per-task generation time from server timings
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : rename display, escaped, and count variables to use prefix convention
- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)
Assisted-by: llama.cpp:local pi
* llama-eval : support multiple evaluation endpoints with dynamic task distribution
- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before
Assisted-by: llama.cpp:local pi
* llama-server-simulator : replace Flask with stdlib http.server
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports
Assisted-by: llama.cpp:local pi
* llama-eval : update README with PR link and quick-start examples
Assisted-by: llama.cpp:local pi
* llama-eval : track model name in eval state and verify on resume
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming
Assisted-by: llama.cpp:local pi
* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
Assisted-by: llama.cpp:local pi
* llama-eval : require --grader-model or --model when using --grader-type llm
Assisted-by: llama.cpp:local pi
* llama-eval : protect dump() with lock for thread safety
Assisted-by: llama.cpp:local pi
* llama-eval : compact HTML report output
- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers
Assisted-by: llama.cpp:local pi
* llama-eval : check server connectivity on startup
- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override
Assisted-by: llama.cpp:local pi
* llama-eval : use server1/server2 instead of gpu1/gpu2 in README
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
* convert : add split() method to LoraTorchTensor
* Fix python type-check
* Fix flake8 Lint
* fix: handle positional dim arg in torch.split dispatch
* Fix type-check again
* Fix type-checks
* Remove unit test per reviewers feedback
* work around ty deficiency
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Q4_1 MoE CLC pass sanity check
* remove unnecessary code
* opencl: remove unnecessary asserts and reformat
* opencl: fix supports_op for q4_1 moe
* q4_1 moe is supported by Adreno with certain shapes
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
`im2col_cuda` and `im2col_3d_cuda` both dispatch with
`block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
`invalid configuration argument`.
Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the
kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern
already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel`
and 3D `im2col_3d_kernel` need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).
Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
`invalid configuration argument`, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)
* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)
* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)
* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16
bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
This commit updates the command line arguments to use the correct names
and values which are now required.
The motivation for this change is that currently running the example
command as is will generate the following errors:
```console
error while handling argument "--color": error: unknown value for --color: '--sampling-seq'
usage:
-co, --color [on|off|auto] Colorize output to distinguish prompt and user input from generations
('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
error while handling argument "-fa": error: unknown value for --flash-attn: '--temp'
usage:
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max
usage:
--draft, --draft-n, --draft-max N the argument has been removed. use --spec-draft-n-max or
--spec-ngram-mod-n-max
(env: LLAMA_ARG_DRAFT_MAX)
error while handling argument "--draft-min": the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min
usage:
--draft-min, --draft-n-min N the argument has been removed. use --spec-draft-n-min or
--spec-ngram-mod-n-min
(env: LLAMA_ARG_DRAFT_MIN)
```
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* Revert "convert : add image break token fallback"
This reverts commit 292e40cfdf.
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Co-authored-by: Pascal <admin@serveurperso.com>
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* convert : allow zero value for img_break_tok_id
* ggml-cuda: add internal AllReduce provider for tensor parallelism
Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR
using a single-phase CUDA kernel that pipelines D2H copy, cross-GPU
handshake via pinned-memory volatile flags, and the reduction in one
kernel launch per GPU.
New files:
- ggml/src/ggml-cuda/comm.cuh — ggml_cuda_allreduce_provider enum
- ggml/src/ggml-cuda/allreduce.cuh — pipeline API declarations
- ggml/src/ggml-cuda/allreduce.cu — kernel + pipeline init/dispatch
ggml-cuda.cu changes:
- ggml_backend_cuda_comm_context gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: add --allreduce flag to select AllReduce provider
Adds --allreduce <auto|nccl|internal> to llama-bench (and via the shared
field pattern, consistent with other multi-value flags). Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: rename --allreduce to --reduction-provider / -rp
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
via the shared
field pattern, consistent with other multi-value flags). Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: pass WARN/ERROR log messages through in non-verbose mode
The null log callback was silently dropping all messages. WARN and ERROR
should always be visible since they indicate legitimate issues (e.g. a
requested reduction provider not being available).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* cmake: improve NCCL detection for source-tree builds, add static/dynamic switch
FindNCCL.cmake now searches the cmake source-build layout used by the Windows
NCCL port (cmake/lib/Release for static, cmake/src/Release for dynamic import
lib) and also checks src/include for the generated nccl.h header.
New option GGML_CUDA_NCCL_STATIC (default OFF) selects static vs dynamic
linking and controls which paths and library names are searched.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: add AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG)
When compiled with -DGGML_CUDA_AR_WATCHDOG=ON, uses a debug kernel
variant that writes per-GPU spin diagnostics to pinned host memory.
A host-side blocking poll (cudaEventQuery + volatile reads) detects
hangs and logs WARN with the last observed arrival counters and spin
counts, controlled by GGML_CUDA_AR_WATCHDOG (ms timeout) and
GGML_CUDA_AR_MAX_SPIN (kernel bailout) env vars at runtime.
Zero overhead on the production path — all debug code is behind #ifdef.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: fix intermittent AllReduce hang on Blackwell PCIe
Add __threadfence_system() before the arrival signal write in
signal_set to ensure D2H data is globally visible before the peer
observes the arrival flag. Without this fence, the peer could enter
Phase 3 host reads before the data had fully landed, causing an
intermittent deadlock on RTX 5090 (Blackwell, PCIe-only).
Also redesign the watchdog from a blocking dispatch-thread poll to a
non-blocking background thread, eliminating the ~20ms per-slot
latency the old design added.
Verified: 30/30 soak test runs clean at ~50 t/s (previously ~1-in-15
hang rate).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: fix watchdog shutdown ordering and pipeline_free drain
- Stop watchdog thread BEFORE destroying GPU resources (events, streams)
to prevent polling destroyed handles → spurious "busy" readings
- Add cudaStreamSynchronize in pipeline_free to drain in-flight kernels
before freeing pinned host buffers they may still be reading
- Sleep-first watchdog polling: no +0ms noise, only logs when a kernel
is genuinely stuck past the poll interval
- Check wdog_stop in both outer and inner loops so join() returns
promptly instead of draining the entire queue
- Add Phase 3 breadcrumbs to debug[3] for hang localization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: replace event-based watchdog with per-GPU ring buffer
Completely rework the GGML_CUDA_AR_WATCHDOG system:
- Replace the shared debug_buf + event-polling + queue design with
per-GPU ring buffers in pinned host memory
- Kernel writes a debug record only on spin-limit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: normalize line endings to LF (undo Windows CRLF conversion)
Five files were inadvertently converted to CRLF by the Windows
development environment, causing every line to show as changed in
diffs against master.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* .gitattributes: force LF line endings to prevent Windows CRLF conversion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
elopment environment, causing every line to show as changed in
diffs against master.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: move GGML_CUDA_AR_WATCHDOG from CMake option to local define
The watchdog is development-only; a global CMake option is overkill.
Move the toggle to a #define at the top of allreduce.cu (set to 0 by
default) and remove the option from ggml/CMakeLists.txt and the CUDA
CMakeLists.txt add_compile_definitions block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* unify kernel debug paths
* use __threadfence_system explicitly (not in ggml_cuda_ar_signal_set)
* preferentially use internal reduction for <=2 GPUs
* templatize the main kernel to support fp16/bf16
* restore llama-bench.cpp changes
* revert CMakeLists changes
* remove notes from repo
* remove dead warmup code
* fix comments
* improve reduction provider fallback code
* add messages for allreduce fallback
* rework reduction provider init to not call ncclCommInitAll if using the internal provider
* fix case where a given tensor has not been computed
* add chunked mode to the kernel for unlimited vector size
* rework a few checks/fallbacks
* various small cleanups
* allow disabling CUDA reductions completely (falling back to the non-CUDA butterfly mode)
* simplify reduction provider selection
* minor simplifications
* more cleanups/fixes
* prototype alternate path for large reductions
* chunked version of large reduction path
* use bf16 for large reductions
* experimental reduction using cudaMemcpyPeerAsync (slightly slower)
* revert experimental change
* add combined conversion/reduction kernel
* add bf16 wire format for single kernel mode
* experimental on-stream small reduction kernel
* double buffer arrival slots, use token (incrementing) method
* double buffer host_buf for small reductions
* put in waits for use of host_mem in large reduction case (prevents stomping on in-use memory
* remove watchdog code
* various cleanups / dead code removal
* fix fp16 mode
* fix some comments/logging statements
* use increasing token scheme for arrival signals
* add top-level comment to allreduce.cu
* improve top-level comment in allreduce.cu
* fix comments in ggml_cuda_ar_kernel
* improve event handling for hostmem buffer usage tracking
* change ev_pool to fixed 2D array
* add chunked memcpy fallback for extra-large reductions (>32 MB)
* change thresholds for copy-engine path and bf16 demotion
* multi-block kernel test
* more fine-tuning for chukn-size, etc.
* various fixes for PR review
* more PR fixes
* fix semantics of all host mappings
* require ampere+
* small cleanups
* properly use host pointer for src/dst in cudaMemcpy calls
* allreduce: lazy-init the internal pipeline on first use
A config that lives entirely on NCCL never needs the chunked-kernel
pipeline (host_buf, host_large, dev_tmp, streams, events, arrival ring).
Defer pipeline creation to the first try_allreduce_internal call using the
same std::call_once pattern as ensure_nccl, so those resources stay
unallocated when only NCCL is in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: assert n_backends == 2 instead of soft-fallback
ar_pipeline_init already requires n_devices == 2 and bails before any AR can
get here, so by the time we reach try_allreduce_internal we know we have
exactly two backends. Replace the runtime-debug-log fallback with a hard
assert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NCCL is in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* rework reduction provider selection. internal/nccl is OS dependent; most fallbacks are removed
* remove unneeded Turing arch check (llama.cpp doesn't even compile pre-Turing anyway)
* allreduce: ASCII-only comments and ggml_cuda_cast for value conversions
Replace non-ASCII characters in comments (em dashes, right arrows) with
ASCII equivalents (--, ->) so the source stays in the ggml/upstream norm.
In the kernel-side code, replace static_cast<Twire>/static_cast<Tdst>
with ggml_cuda_cast<...> so the BF16 conversions go through the fast
__float2bfloat16 / __bfloat162float intrinsics from convert.cuh. Pure
pointer and integer casts stay as static_cast.
Also drops two stray garbage tokens that snuck in from earlier merges
(a duplicated 'return ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: use ggml_cuda_memcpy_1 for the chunked-kernel vector copies
The chunked kernel's two 16-byte register<->host transfers (Phase 1 store
and Phase 3 load) used reinterpret_cast<float4 *> on both sides. Replace
with ggml_cuda_memcpy_1<sizeof(wire)>, which is the canonical helper for
this pattern and emits the same int4 LD/ST under the hood.
Conformance passes; 5x reruns of 70b internal pp512 show 1832-1836 t/s,
matching the prior matrix value of 1831 t/s -- no perf change as expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: assert cuda_ctx->device matches the pipeline's device
Both ggml_cuda_ar_pipeline and ggml_backend_cuda_context carry the device
they were created for; if they ever disagree, every cuda call that follows
runs on the wrong device. Add GGML_ASSERT at each cuda_ctx retrieval site
in the AR path so the misuse fails fast rather than silently corrupting.
Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: expand one-liner for loops to braced bodies
Code-style preference -- match the rest of the file by writing every for
loop with the body on its own braced line. Three sites in the copy-engine
typed dispatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
in the AR path so the misuse fails fast rather than silently corrupting.
Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: rename template parameters Tdst/Twire/Tsrc -> T_dst/T_wire/T_src
Code-style preference per PR review -- T_dst/T_wire/T_src is more
consistent with surrounding code. Whole-word rename across all 58 sites
in allreduce.cu (kernel definitions, internal uses, and comment text).
Realigned the parameter columns in three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: drop hyphen in 'chunked-kernel' across comments
Per PR review feedback -- 'chunked kernel' (no hyphen) reads more naturally
in running prose, especially for ESL readers. Pure comment-only change;
all 10 occurrences in allreduce.cu updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: use ggml_cuda_get_max_cpy_bytes() instead of hardcoded 16
The chunked kernel hardcoded a 16-byte vector unit; replace with the
ggml_cuda_get_max_cpy_bytes() helper that fattn-common.cuh uses for the
same purpose, so ELEMS_PER_VEC self-adjusts to the arch's widest
single-instruction copy.
Perf-neutral on supported targets (Volta+ returns 16).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ggml-cuda: PR review fixes -- annotate #endif, fix stale comment, assert nbytes alignment
Three separate but minor changes from PR #22299 review feedback:
1. Annotate the five GGML_USE_NCCL #endif lines with the matching condition
so the pairing is visible without scrolling back.
2. The comment block on ggml_backend_cuda_comm_context claimed NCCL is
lazy-initialised; that was true at one point but the dispatch refactor
(727b141c0) made both NCCL and the internal pipeline eager. Rewrite
the comment to match current behaviour.
3. Assert in ggml_backend_cuda_comm_allreduce_internal that the tensor's
byte size is a 16-byte multiple. The chunked-kernel issues full-width
vector loads/stores, so this is a precondition; tensor-parallel splits
of hidden-dim-multiples satisfy it trivially, but a hard assert turns
any caller-side bug into a clear failure rather than UB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
device's new AR
records its ev.ker -- otherwise the second device's wait sees the first
device's just-recorded event (the in-flight new AR) and creates a circular
dependency with the in-kernel peer signal. Two-pass dispatch (all waits,
then all launches) avoids this.
Bump POOL_SIZE 2 -> 8 (small memory cost, more breathing room for the
GPU's view of the event chain) and add a runtime env override for the
hybrid kernel chunk size (GGML_CUDA_AR_HYBRID_CHUNK_BYTES) for tuning.
One-shot stderr diagnostic at first AR prints the chosen path + sizing.
Result on 2x RTX 5090 Linux, 70b ub_sweep:
ub=64 (1 MB AR): 913 -> 1036 t/s (+13.5% vs old, +1.8% vs NCCL)
ub=128 (2 MB AR): 1056 -> 1181 (+11.9%, +3.7% vs NCCL)
ub=256 (4 MB AR): 1212 -> 1424 (+17.5%, +3.5% vs NCCL)
Internal now beats NCCL at every size (+1.8% to +15.6%), recovering all
ground in the 1-4 MB regime that was previously a 10-12% loss.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* simplify the init logic
* address some other PR requests
* ggml-cuda: stub internal AllReduce on HIP/MUSA, drop pre-Ampere mention, gate NCCL fallback warning on !HIP
The internal AllReduce relies on cudaHostAllocPortable/Mapped,
cudaHostGetDevicePointer, and __nanosleep -- none of which the HIP or
MUSA shims expose -- so wrap the implementation in
!defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) and provide
nullptr/no-op/false stubs in the #else branch. The dispatcher already
treats a null pipeline as init failure and silently falls back to the
meta backend's generic AllReduce, so HIP/MUSA builds compile clean and
behave correctly without further call-site changes.
PR review follow-ups:
- drop "or pre-Ampere?" from the internal-init failure warning -- the
kernel doesn't require Ampere or newer.
- guard the "NCCL not compiled in" fallback warning behind
!defined(GGML_USE_HIP); the suggestion to install NCCL only makes
sense on NVIDIA builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: guard __nanosleep on Volta+ and reject pre-Volta devices at init
__nanosleep is the only Volta-specific intrinsic in the kernel; wrap it
in #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA / NO_DEVICE_CODE so the file
still compiles cleanly when targeting older arches (the dispatcher's
init check below ensures the kernel is never actually launched on
pre-Volta).
Add a per-device compute-capability check in pipeline_init that returns
nullptr if any device is below sm70. The dispatcher already treats
nullptr as init failure and silently falls back to the meta backend's
generic AllReduce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rom the internal-init failure warning -- the
kernel doesn't require Ampere or newer.
- guard the "NCCL not compiled in" fallback warning behind
!defined(GGML_USE_HIP); the suggestion to install NCCL only makes
sense on NVIDIA builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: fix CI -Werror warnings (sign-compare, format, restrict alias, maybe-uninitialized)
The CUDA CI builds with -Werror -Wsign-compare -Wformat -Wrestrict
-Wmaybe-uninitialized. Address each:
- n_devices is size_t; change `int i; i < n_devices` to size_t in the
three init loops, and the matching GGML_LOG_INFO format from %d to %zu.
- ggml_cuda_ar_kernel was launched with sendbuf == recvbuf (in-place
reduction), so the __restrict__ qualifiers on those parameters were
technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B
sweep showed <0.6% perf delta (within noise) on Linux.
- The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
indices [0, n_devices); zero-initialise so the compiler sees the
tail elements as defined.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ggml-cuda: drop unused-function warning by guarding try_allreduce_nccl behind GGML_USE_NCCL
The only call site (in init_nccl) is already inside #ifdef GGML_USE_NCCL,
so the function is unreferenced in non-NCCL builds and trips
nvcc's -Werror=unused-function check. Move the guard from inside the
function body to around the entire definition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce
reduction), so the __restrict__ qualifiers on those parameters were
technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B
sweep showed <0.6% perf delta (within noise) on Linux.
- The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
indices [0, n_devices); zero-initialise so the compiler sees the
tail elements as defined.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* SYCL: reduce allocation overhead during flash attention
* tidy up whitespace
* add a note about the flag
* move ggml_sycl_fattn_* into fattn-buffers.hpp
* refactor implementation into fattn-buffers.cpp
* move new_fattn_kv_buffers back into ggml-sycl.cpp
Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in
supports_op and in the kernel dispatch. This fixes a performance
regression where models using BF16 embedding tensors (e.g., Gemma4's
per_layer_token_embd.weight) fall back to CPU for the GET_ROWS op,
causing a full GPU-to-CPU tensor transfer every token.
The fix reuses the existing get_rows_sycl_float template with
sycl::ext::oneapi::bfloat16, matching the pattern already used for
sycl::half (F16) and float (F32).
2026-05-09 08:50:24 +03:00
Intel AI Get-to Market Customer Success and Solutions
* sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations
Signed-off-by: Chun Tao <chun.tao@intel.com>
* Remove unneeded/unnecessary comments and annotations
The MMQ subgroup annotations added are on functions gated behind
ggml_sycl_supports_mmq(). Revisit the need for these annotations
when that function changes.
---------
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128
* mimo-v2.5: follow (256, 256) fattn templates
* mimo-v2.5: cleanup comments
* mimo-v2.5: further comment cleanup
* mimo-v2.5: address PR feedback
fix GQA handling
check for other dangling 320/576 carveouts and mirror them for 192
Add to backend ops test so new paths are covered
Implement the Gated Delta Net recurrence on HVX with:
- 4-row fused kernels for PP (prompt processing) path
- 8-row fused kernels for TG (token generation) path, reducing
K/Q/gate vector reload overhead by 2x
- Separate PP/TG thread functions for I-cache isolation
- VTCM state scratchpad with DMA in/out for TG single-cycle access
- Vectorized gate exp via hvx_exp_f32
2026-05-08 17:12:04 -07:00
Intel AI Get-to Market Customer Success and Solutions
* L2_NORM Updates
* Addressed PR Comments
* ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend
* hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* server: support Vertex AI compatible API
* a bit safer
* support other AIP_* env var
* various fixes
* if AIP_MODE is unset, do nothing
* fix test case
* fix windows build
* cuda: fuse snake activation (mul, sin, sqr, mul, add)
Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.
Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.
* cuda: address review feedback from @am17an
Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.
* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an
* Update tests/test-backend-ops.cpp
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cuda: snake fusion check add->type matches x->type
Address review feedback from @am17an
* cuda: snake fusion check add->type matches x->type
Moved for readability (equivalent)
Address review feedback from @am17an
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Q4_0 MoE CLC pass sanity check
* release program
* opencl: fix whitespace
* opencl: remove unused cl_program
* opencl: break #if block to make it more clear
* opencl: adjust format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* convert : fix RuntimeError when stripping FP8 KV-cache scales
In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).
Wrap the keys view in list() so the deletions happen on a snapshot.
* re-add another accidentally removed list
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
* webui: add LLM title generation option
* webui: use chat_template_kwargs for title gen + fix conversation check
* webui: capture firstUserMessage before async streamChatCompletion to fix race condition
* webui: extract LLM title generation into separate method
* webui: use constants and ChatService for LLM generated titles
* webui: rebuild static output
* webui: add LLM title generation setting to new settings location
* webui: use sendMessage in generateTitle
* webui: rebuild static output
* webui: fix formatting
* webui: configurable title prompt, remove think tag regexes, fix TS error
* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition
* webui: rebuild static output
* Write a readme on Multi-GPU usage in llama.cpp
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution
was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0
Fix is to get rid of a dollar sign.
* common: do not fit to unknown device memory
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: preserve host fallback for non-GPU fit devices
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: keep unknown GPU fit memory at zero
Signed-off-by: Florian Reinle <f.reinle@otec.de>
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* feat: migrate to PEP 621 and add uv support
* fix: remove upper bound on protobuf
* remove poetry.lock and uv.lock
* fix/add torch dependency version and markers
* fix dev-dependency deprecation warning
* gguf-py : update python version requirement to 3.10
---------
Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* convert : ignore non-language tensors for Gemma4Model
This commit adds a check to make sure only text language tensors are
handled in filter_tensors.
The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
main()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
model_instance.write()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
self.prepare_tensors()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
new_name = self.map_tensor_name(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```
* add forgotten embed_vision and embed_audio
* improve
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hex-mm: process m-tail rows on HMX instead of HVX
* hmx-mm: unroll and optimize padded activation loop
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3719 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3726 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3733 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3740 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3747 | [](common_params & /*params*/, int /*value*/) {
| ^
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.
Assisted-by: llama.cpp:local pi
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
* docs : update speculative decoding parameters after refactor (#22397)
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
* shader(norm): add layer norm ops
* shader(norm): stablize floating point computation with Kahan summation and handle mixed types
* shader(norm): remove the non-contiguous strides
* shader(norm): use the original implementation rather than the kahan summation
Llama-architecture q_proj/k_proj weights need an axis-0 row permutation
to match GGML's RoPE convention. The BF16 path applies this in
LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path
bypasses modify_tensors and writes weights directly through
ModelBase._repack_nvfp4. Without the permutation, attention heads end
up scrambled at inference and the model produces gibberish.
This change overrides _repack_nvfp4 on LlamaModel and applies the same
permutation to both the nibble-packed weight and the per-block scale
before delegating to ModelBase._repack_nvfp4 via super(). Reuses the
existing LlamaModel.permute static helper and respects the existing
undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.)
inherit the fix automatically.
Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419
(gibberish) to 43.9, matching the BF16-dequantized baseline (44.0).
Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama
architecture) with multilingual generation in Spanish/Catalan/Basque/
Galician all coherent with the fix applied.
Co-authored-by: Chema <chema@montevive.ai>
* hmx: extract shared interleave headers and unify matmul batched
* hmx: add HMX-accelerated flash attention for prefill
* hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h
Switches three single-instruction helpers from inline asm to the matching
Q6_ intrinsics, matching the style established by aizip f8737609a and used
by the upstream PR #21554 hmx-matmul-ops.c rewrite:
hmx_set_output_scales asm "bias=mxmem2" -> Q6_bias_mxmem2_A
hmx_load_tile_pair_fp16 asm packet -> Q6_activation_hf_mxmem_RR
+ Q6_weight_hf_mxmem_RR
hmx_consume_accumulator_fp16 asm "mxmem=acc" -> Q6_mxmem_AR_after_hf
hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation
streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep
Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check
("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps
both halves in one VLIW packet and avoids the diagnostic.
Functionally equivalent — same instructions emitted; the Q6_ intrinsics
just give the compiler more visibility for scheduling.
* hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles
* hmx: apply upstream optimization to hmx-flash-attn-ops.c
apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext.
* hmx: unify interleave helper
* hmx: multi-thread Q load / O store and enable prefill FA dispatch
Extract inline Q-load and O-store loops into worker_pool-parallel helpers
(fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16
conversion work across row ranges. Also relax the softmax threading
gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily
forcing single-thread fallback when n_rows_g < 512.
On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head
(prefill) FA from reaching the HTP backend — GQA is already handled
internally by both the HMX and HVX flash-attention paths.
* hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down)
* hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup)
* hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug
* [experimental]: fp16 softmax (EXP2_HF) to accelerate fa
Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff
(base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs
~44 for the F32 round-trip path.
* hmx flash-attn: refine cost model coefficients based on profiling data
* hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles
* hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride)
* hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline
* hmx flash-attn: preserve additive mask bias in no-ALiBi fast path
The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on
the assumption that mask values are only {0, -inf}. This is wrong when
the mask carries additive positional bias — those terms were silently
dropped. Keep the slope-mul skip (slope≡1.0) but add mask back so the
bias survives; vmux still clamps below -16 to -inf.
Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64,
nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV,
hs ∈ {64, 128}.
* hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests
- flash-attn: when EXP2_HF is on AND logit_softcap is active, fold
log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking
it into qk_scale. Pre-baking shifted the tanh knee from x≈c to
x≈c/log2(e) and produced numerically wrong softcapped outputs
whenever both knobs were enabled.
- flash-attn softmax (fa_softmax_thread): replace the union+memcpy
scalar extract pattern with HVX vmux-based per-row accumulators on
rowmax/rowsum. Add hvx_vec_get_f16 helper in hvx-base.h. Functional
parity, less scalar code, clearer hf/qf16 lane-format contract.
- matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential
layout based on whether the chunker actually yields >=2 n-chunks,
instead of the static (m>=128 && n>=256) gate. Avoids paying for
output double-buffer + worker dispatch when there is no HMX/HVX
overlap to gain (e.g. shapes that collapse to one n-chunk).
- tests: add HMX flash-attention coverage over the
{mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill
path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32
non-aligned case.
* [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability
* format code
* hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead
* hexagon: use hex_ prefix for swap_ptr
* hexagon: move vtcm_seq_alloc into vtcm-utils.h
More vtcm allocator updates are coming so it makes sense to start the separate hdr for it.
* hmx-utils: add hmx_prefix for layout converters
* hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc
* hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes
* hmx-fa: minor alignment fixes
* hmx-fa: move hmx_flash_atten into hmx-ops.h
* hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates
* hmx-fa: minor alignment and simplifications
* hexagon: move FA_EXP_F16 option to hostside CMake file
* hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits
* hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill
* hmx-fa: some more alignment updates in the core fa function
* hmx-fa: keep slopes in vtcm in fp16
Saves malloc/free and removes the need for float -> fp16 downcast on every use.
* hexagon: consistent noinline usage (after static)
* hex-hmx: consistent use FARF_HIGH to enable debug output
* hmx-utils: no need for always_inline attr
* hex-hmx: consistent noinline usage (static noinline ...)
* hex-hmx: simplify init_col_scales
* hexagon: fix editorconfig errors
* hmx-mm: minor alignment fixes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
fix#22544 (my fault!)
Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
* vulkan: Support asymmetric FA in coopmat2 path
There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.
Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).
We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.
* reorder cases
* Add mat-vec fast path of MUL_MAT_ID.
* Add shared accumulation vec logic and the other types supports.
* Add i-quant mat-mat for MUL_MAT_ID and fix some parts
* Remove n_experts from shader_lib_context.
* scripts : add wc2wt.sh - create worktree from current HEAD
Add a script to create a git worktree on a new branch from the current
HEAD. Similar to pr2wt.sh but for local development branches instead of
PRs.
Usage:
./scripts/wc2wt.sh gg/new-feature
./scripts/wc2wt.sh gg/new-feature "bash -l"
Assisted-by: llama.cpp:local pi
* cont : no need to try to delete the branch
* port #22358 PR to examples/speculative/speculative.cpp
* use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft
model / target model vocabulary mismatch
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* hexagon: allow host to set max vmem size
We use a sane default but it's helpful to allow for an override if needed.
* hexagon: add support for measuring vmem space and move pinned mmaping management to host
* hexagon: update vmem checks to use uint64
* hexagon: bump op buffers to 16 (matches max mmaps)
* hexagon: bump default vmem to 3.2GB
* hexagon: add support for autodetecting vmem space and some logging cleanup in that area
* hexagon: fix whitespace warnings
* Update scripts/snapdragon/adb/run-cli.sh
Co-authored-by: Pascal <admin@serveurperso.com>
* hex-adb: fix run-completion script
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME
When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:
Error: unrecognized opcode `vmadot v16,v14,v0',
extension `xsmtvdotii' required
Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.
toolchain from https://www.spacemit.com/community/resources-download/Tools
* Update ggml/src/ggml-cpu/CMakeLists.txt
Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
---------
Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
* Changed to leak logger singleton to prevent hanging on Windows
* Fix comment
* Stopped using static vector
Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash
* Change so all logs are output before exit
* Added debug logging
* added more logging
* Added logging
* Explicitly free logger to avoid hanging on Win
* Reverted to leak logger instance again
* Removed debug log and fixed comment
* Fixed comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel
* Change arrays to static const in repack.cpp
---------
Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)
Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.
* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32
* Apply suggestions from code review
Address review comments
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256
* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.
Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.
`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
https://github.com/KhronosGroup/glslang/issues/4206).
In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.
Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.
On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,
for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
* Instruction Count: 2753 -> 2688
* SEND Count: 269 -> 261
* Cycle Count: 273976 -> 266138
* Max live registers: 248 -> 246
* Non SSA regs after NIR: 381 -> 382
for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
* Instruction Count: 2767 -> 2702
* SEND Count: 271 -> 263
* Cycle Count: 274140 -> 268144
* Max live registers: 248 -> 246
* Non SSA regs after NIR: 381 -> 382
for shader mul_mat_vec_id_q4_k_q8_1_f32:
* Instruction Count: 1930 -> 1646
* SEND Count: 116 -> 71
* Cycle Count: 1348306 -> 843350
* Max live registers: 78 -> 84
* Non SSA regs after NIR: 300 -> 135
for shader mul_mat_vec_id_q5_k_q8_1_f32:
* Instruction Count: 2207 -> 1922
* SEND Count: 131 -> 86
* Cycle Count: 1392012 -> 1037836
* Max live registers: 90 -> 90
* Non SSA regs after NIR: 300 -> 135
for shader mul_mat_vec_q4_k_q8_1_f32:
* Instruction Count: 2029 -> 1749
* SEND Count: 111 -> 66
* Cycle Count: 1347278 -> 840118
* Max live registers: 74 -> 80
* Non SSA regs after NIR: 299 -> 134
for shader mul_mat_vec_q5_k_q8_1_f32:
* Instruction Count: 2307 -> 2022
* SEND Count: 126 -> 81
* Cycle Count: 1379820 -> 954042
* Max live registers: 86 -> 86
* Non SSA regs after NIR: 299 -> 134
On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
* pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
* pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
* tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)
On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
* pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
* pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
* tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)
On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
* pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
* pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
* tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
* ggml : revert to -lm linking instead of find_library
`find_library(MATH_LIBRARY m)` was introduced recently, but it breaks
CUDA compilation with GGML_STATIC. I could not find any valid use case
where we would prefer `find_library` over the standard `-lm` approach.
This commit is also meant to start a discussion if there is a valid
reason to keep `find_library(MATH_LIBRARY m)`, we should clarify what
problem it was solving and find an alternative fix that does not break
CUDA with GGML_STATIC.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : use MATH_LIBRARY only if defined
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : fix initial broken condition
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : always respect MATH_LIBRARY when defined
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus
Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
aclnnGeGluV3 when applicable; fallback conditions now checked inside
each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)
Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
fields in ggml_graph_node_properties; has_matching_properties() was
missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
bytes so that ops differing only in parameters are not incorrectly
replayed from cache
* This commit enables the router to forward form-data to model server.
Fixes#22044 (enabling to use the /v1/audio/transcriptions in router mode)
* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local
* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053
* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647
* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279
* Added double quote to the sanitize lambda and throw on json parse failure
---------
Co-authored-by: Ralph Paßgang <ralph@trust-it.de>
* common: refactor common/debug to move abort_on_nan into base_callback_data
Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.
* cont : cleanup
* common : use pimpl in debug.h to reduce header dependencies
Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.
This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
* Optimize Metal Tensor API usage for matmul2d
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.
The legacy simdgroup_matrix kernel is preserved under #else.
Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
* cont : cleanup
* cont : cleanup
* cont : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.
In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
* opt arc770 for Q4_0
* add for Q4_0
* update the script
* add help script for windows
* update guide
* fix format issue
* convert from dos to unix for format issue
* fix missed -sm parameter
* switch ubuntu-latest to ubuntu-slim
* Fix the path for upload so CI doesn't fail
* Update .github/workflows/build-and-test-snapdragon.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use -slim image for key check and consistent naming for artifact dir
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Remove check-secret extra job
* move QDC key check for Run QDC jobs step specifically
* add a step before to check the secret for qdc jobs
---------
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml-webgpu: add tile flash attention fallback
* ggml-webgpu: add new fields and discard usage of mnk for tile version
* ggml-webgpu: modify the vec path to discard the mnk parameter
* ggml-webgpu: enable flash attention vec and tile version for broswer
* ggml-webgpu: stagging KV for flash attention tile version
* formatting
* turn on subgroup uniformity check
* remove Q_TILE as it is always 1 for vec path
* make row_max and exp_sum to local register
* make different bindings with same underlying buffer to have the same usage flags
* move path selection into the shader library and have the host consume a single flash-attn decision object.
* turn off skip_validation and address buffer overlapping when nwg==1
* formatting
* merge binding when kv overlap
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
* hexagon: restore HTP_OPMASK_QUEUE
* hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul
* hex-prof: restore op profiling
* hex-prof: enable PMU
* hexagon: simplify and improve op-queuing with full profiling support
Add separate profile descriptors.
* hexagon: remove opsync and rename opmask into opstage
opsync is no longer needed since the profiler is fully async now.
opmask name was confusing and opstage is more accurate.
* hexagon: refactor opbatch queue handling
* hexagon: add iface hooks for enabling profiler from the host
Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use.
* hexagon: make profiler mode configurable
On older devices getting PMU counters is expensive so it's now optional.
* hexagon: add support for setting profiler pmu events from env
* hexagon: simplify profiler output (no need to print buffs, etc)
* hexagon: simplify pmu counter formating
* hexagon: add a simple profile post-proc tool
* hex-prof: add support for reading logs from stdin
* hexagon: document GGML_HEXAGON_PROFILE
* hex-prof: update default width for dims field
* hex-prof: fix linter warnings and errors
* Update ggml/src/ggml-hexagon/htp/htp-ops.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/snapdragon/ggml-hexagon-profile.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add the tests that we want to run on external CI
* remove extra files
* Fixes python issues, reove the deadlock on CI
* remove unecessary changes
* use override to ty.toml
* fix pre-commit and try tests with secret in external repo not upstream
* skip if key is unavailable
* Fix feedback
* switch hexagon to snapdragon
* cleanup
* fix secrets
* remove the copyrights at the top of the files
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
* model-conversion : fix mmproj output file name [no ci]
This commit updates the convert-model.sh script to properly handle
mmproj output files.
The motivation for this that currently the same name as the original
model is used as the mmproj file, which causes the original model to
be overwritten and no mmproj-<model_name>.gguf to be created.
* model-conversion : use MODEL_NAME [no ci]
after recreating the CMake build directory and with `-DGGML_CCACHE=OFF`.
If the compilation succeeds with ccache disabled you should be able to permanently fix the issue
by clearing `~/.cache/ccache` (on Linux).
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
description:Something goes wrong when using a model (in general, not specific to a single llama.cpp module).
description:Something goes wrong when running a model (crashes, garbled outputs, etc.).
title:"Eval bug: "
labels:["bug-unconfirmed","model evaluation"]
body:
@@ -12,6 +12,8 @@ body:
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
The `llama-completion` binary can be used for simple and reproducible model inference.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type:textarea
id:version
attributes:
@@ -98,8 +100,8 @@ body:
label:Relevant log output
description:>
Please copy and paste any relevant log output, including the command that you entered and any generated text.
For very long logs (thousands of lines), preferably upload them as files instead.
On Linux you can redirect console output into a file by appending ` > llama.log 2>&1` to your command.
For very long logs (thousands of lines), please upload them as files instead; the `--log-file` CLI argument can be used for this purpose.
On Linux you can alternatively redirect the console output of any command into a file by appending ` > llama.log 2>&1` to your command.
This issue template is intended for miscellaneous bugs that don't fit into any other category.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type:textarea
id:version
attributes:
@@ -86,8 +88,8 @@ body:
description:>
If applicable, please copy and paste any relevant log output, including any generated text.
If you are encountering problems specifically with the `llama_params_fit` module, always upload `--verbose` logs as well.
For very long logs (thousands of lines), please upload them as files instead.
On Linux you can redirect console output into a file by appending ` > llama.log 2>&1` to your command.
For very long logs (thousands of lines), please upload them as files instead; the `--log-file` CLI argument can be used for this purpose.
On Linux you can alternatively redirect the console output of any command into a file by appending ` > llama.log 2>&1` to your command.
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggml-org/llama.cpp/discussions/categories/ideas)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
Don't forget to [check for existing refactor issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- provide KL divergence data calculated vs. the FP16/BF16 (whichever is the native precision) version for both the new type as well as types of similar size
- provide [performance data](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) for the new type in comparison to types of similar size on pure CPU
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If you are a new contributor, limit your open PRs to 1.
- If you are a new contributor
- Limit your open PRs to 1
- Do not submit trivial fixes (e.g. typos, formatting changes)
After submitting your PR:
- Expect requests for modifications to ensure the code meets llama.cpp's standards for quality and long-term maintainability
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n",__func__);
LOG_INF("%s: fitting params to device memory ...\n",__func__);
LOG_INF("%s: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)\n",__func__);
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.