8728 Commits

Author SHA1 Message Date
Georgi Gerganov
5e9c635463 metal : add missing mm-id specializations for q1_0 (#21662) 2026-04-09 10:54:00 +03:00
Aleksander Grygier
9949ad08f6 fix: Model Selector choice sync (#21628) 2026-04-09 09:46:27 +02:00
AUTOMATIC1111
3ee9da0e4f server : fix grammar commandline args (#21543)
Co-authored-by: AUTOMATIC <->
2026-04-09 10:16:54 +03:00
Aleksander Grygier
75511a8d7e webui: Add option to pre-encode conversation for faster next turns (#21034) 2026-04-09 09:10:18 +02:00
Akarshan Biswas
b54cb2e3d0 sycl : add flash-attn support for head size 512 (#21654)
* sycl : add flash-attn support for head size 512

This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512.

Changes:
- Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels.
- Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256).
- Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`.
- Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization.
- Added necessary template instances for the new 512 head size across various quantization types.

* remove defunct mxfp4 reorder from setting buffer type
2026-04-09 09:36:48 +03:00
Marxist-Leninist
8a65a7a8ee ci: drop v5 all: composition from labeler.yml (#21627)
actions/labeler@v6 removed the `all:` / `any:` composition keys.
The `server/webui` and `server` entries used `all:` to combine
`any-glob-to-any-file` with negated `all-globs-to-all-files`,
which now errors on every PR with:

    Unknown config options were under "changed-files": all

Flatten both entries to a single `any-glob-to-any-file`. PRs
touching both webui and other server files will now receive both
labels instead of only `server/webui`.

Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>
2026-04-09 08:20:19 +02:00
Ruben Ortlam
8a132faaa0 vulkan: unify type macros to use Vx instead of _VECx (#21605) 2026-04-09 07:31:51 +02:00
Adrien Gallouët
4293919068 common : skip non-primary GGUF split files when selecting model (#21633)
We should not assume files are listed in order.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8721
2026-04-09 07:28:06 +02:00
Aman Gupta
d12cc3d1ca CUDA: also store node->src->data ptrs for equality check (#21635)
* CUDA: also store node->src->data ptrs for equality check

* address review comments
b8720
2026-04-09 01:01:56 +08:00
RealOrko
2dcb7f74ed fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (#21592)
* fix: free ctx_copy in ggml_opt_free to plug per-training-session leak

ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every
time the allocated graph shape changes. The last ctx_copy from the
final ggml_opt_alloc call survives until ggml_opt_free is invoked,
but ggml_opt_free was only freeing ctx_static and ctx_cpu, never
ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch
context — ~900 KB for a typical GNN training session in
sindarin-pkg-tensor, surfaced via AddressSanitizer.

ctx_copy is nullptr-initialized and ggml_free() handles NULL safely,
so the new release is guard-free.

* Update ggml/src/ggml-opt.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: realorko <realorko@nowhere.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8719
2026-04-08 17:40:15 +02:00
Yuri Khrustalev
660600081f server: respect the ignore eos flag (#21203) b8718 2026-04-08 17:12:15 +02:00
Aldehir Rojas
d9a12c82f0 vocab : remove </s> eog token if gemma4 (#21492) b8717 2026-04-08 09:53:06 -05:00
Georgi Gerganov
4a05e0c566 webui : send both backend_sampling == false/true (#18781)
* webui : send both backend_sampling == false/true

* feat: Parameter sync

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-04-08 16:35:52 +02:00
John Eismeier
e9fd96283d Propose fix a couple of typos (#21581)
Signed-off-by: John E <jeis4wpi@outlook.com>
b8715
2026-04-08 16:29:03 +02:00
Erik Scholz
3ba12fed0a kv-cache : extend cache quantization checks (#21586)
to also check for enabled flash attention, instead of just auto.
b8714
2026-04-08 16:08:57 +03:00
Reese Levine
5473949070 webgpu : Query for adapter support when registering WebGPU backend (#21579) b8713 2026-04-08 16:08:29 +03:00
Pasha Khosravi
dcdcbad42a metal: Q1_0 backend (#21528)
* initial Q1_0 Metal backend

* tuning q1_0 metal kernels

* add Q1_0 to test-backend-ops

* add Q1_0<->F32 copy test

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8712
2026-04-08 16:07:47 +03:00
Georgi Gerganov
5764d7c6a6 gemma : perform per-layer projections in the first layer (#21612)
* gemma : reduce graph splits by keeping per-layer ops in the input layer

* gemma : put the per-layer proj in the first layer

* cont : move the projection before the layer loop
b8711
2026-04-08 16:06:30 +03:00
Daniel Bevenius
87f4744a80 examples : disable cb_eval callback for --save-logits (#21553)
This commit updates the debug example to not create the
base_callback_data.

The motivation for this is when using `--save-logits`, which is used by
examples/model-conversion scripts, we often don't care about the tensor
outputs and they just add noise to the output. This changes is quiet by
default we can always remove --save-logits to get the tensor outputs
when debugging.
b8710
2026-04-08 14:10:33 +02:00
Piotr Wilkin (ilintar)
85d482e6b6 parser: fix MiniMax handling (#21573) b8709 2026-04-08 12:47:25 +02:00
Georgi Gerganov
ae65fbdf33 tests : remove obsolete .mjs script (#21615) b8708 2026-04-08 13:20:46 +03:00
Aleksander Grygier
3bd9aa1f92 chore: Update labeler to have separate labels for server/webui and server changes (#21567) 2026-04-08 10:35:31 +02:00
Aleksander Grygier
ece522f98c chore: Remove legacy files (#21606) 2026-04-08 09:55:08 +02:00
forforever73
09343c0198 model : support step3-vl-10b (#21287)
* feat: support step3-vl-10b

* use fused QKV && mapping tensor in tensor_mapping.py

* guard hardcoded params and drop crop metadata

* get understand_projector_stride from global config

* img_u8_resize_bilinear_to_f32 move in step3vl class

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix the \r\n mess

* add width and heads to MmprojModel.set_gguf_parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8705
2026-04-08 09:51:31 +02:00
Hamish M. Blair
97508acb17 webui: fix syntax highlighting lost after streaming for non-common languages (#21206)
* webui: fix syntax highlighting lost for non-common languages after streaming

rehype-highlight uses lowlight internally, which only bundles 37 "common"
languages. The streaming code path uses highlight.js directly (192 languages),
so languages like Haskell highlight correctly while streaming but lose all
color once the code block closes. Pass the full lowlight language set to
rehype-highlight so both paths support the same languages.

* webui: rebuild static files after rebase
2026-04-08 08:58:08 +02:00
Martin Klacer
5c4aae66e1 devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (#21259)
* Unified macOS release setup with strategy-matrix block
 * Added KleidiAI arm64 macOS release definition


Change-Id: I05520889ffc646488a178d06817a17f29274465a

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
b8703
2026-04-08 13:06:12 +08:00
Aman Gupta
c5ce4bc227 CUDA: make cuda graphs props check faster (#21472)
* CUDA: compute fast hash instead of expensive props check

* use seen node

* use memcp
b8702
2026-04-08 09:05:51 +08:00
iacopPBK
66c4f9ded0 ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)
* ds_read_b128 for q4_0 and q4_1 mmq kernels

     Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation

* Explicit for loop in mmq, renamed vec into tmp

* Fixed max_cpy usage in the loading loop

* Fixed typo in q4_1 kernel

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Renoved trailing white line 500

* Update mmq.cuh removed other whitelines

* Remove trailing whitespaces

---------

Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
b8701
2026-04-07 21:47:42 +02:00
Daniel Bevenius
93bdc61563 gguf-py : fix missing comma after bad merge in tensor-mapping (#21558)
This commit adds a missing comma in the vision encoder attention qkv
block.

The motivation for this change is that without the comma there will be
a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL
tensor mappings which will be broken.
2026-04-07 21:24:25 +02:00
Georgi Gerganov
4eb19514dd kv-cache : support attention rotation for heterogeneous iSWA (#21513)
* kv-cache : support attention rotation for heterogeneous iSWA

* cont : remove assert
b8699
2026-04-07 20:31:28 +03:00
Reese Levine
957d717ce5 ggml-webgpu: parameterize submission size and add iOS specific limits (#21533)
* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

* Start work on removing parameter buffer pools

* Simplify and optimize further

* simplify profile futures

* Fix stride

* Try using a single command buffer per batch

* formatting

* Add parameters for different browsers in-flight submissions

* Update handling of batch size too

* Throttle ios as much as possible

* Increase timeout for llvm-pipe testing
b8698
2026-04-07 20:30:01 +03:00
Aman Gupta
de1aa6fa73 CUDA: check for buffer overlap before fusing (#21566)
* CUDA: check for buffer overlap before fusing

* use ggml_cuda_check_fusion_memory_ranges
b8697
2026-04-08 00:57:04 +08:00
Aaron Teo
69c28f1547 llama-server: fix model params not propagated (#21509)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b8696
2026-04-07 21:39:41 +08:00
Son H. Nguyen
0d049d6a92 unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257)
* unicode : add custom Qwen2 regex handler to fix segfault on long input

std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.

Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.

Closes: https://github.com/ggml-org/llama.cpp/issues/21113

* cont : remove TODO comment

* cont : update comment to reflect original regex

* use the correct regex in the comment this time... [no ci]

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-04-07 16:13:38 +03:00
Johannes Gäßler
a8ec0df461 llama: remove per-arch tensor name lists (#21531) b8694 2026-04-07 15:02:03 +02:00
Georgi Gerganov
e8f5082697 server : fix restore for checkpoints with pos_min == 0 (#21510) b8693 2026-04-07 15:29:17 +03:00
Georgi Gerganov
22fc79134e ggml : deprecate GGML_OP_ADD1 (#21363)
* ggml : deprecate GGML_OP_ADD1

* cont : remove tests

* cont : re-enable vulkan check
b8692
2026-04-07 15:28:27 +03:00
Tom Overlund
2a619f6fbc ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868) (#20904) b8691 2026-04-07 13:54:55 +02:00
mkoker
edd4d9bca5 vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029)
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
b8690
2026-04-07 13:41:29 +02:00
Aldehir Rojas
482192f12d webui : store reasoning_content so it is sent back in subsequent requests (#21249) 2026-04-07 13:32:44 +02:00
Antoine Viallon
71a81f6fcc ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519)
GGML_CUDA_CC_CDNA2 was set to 0x910
Fix by setting the constant to 0x90a to match the actual gfx90a ISA.
b8688
2026-04-07 12:18:55 +02:00
Aleksander Grygier
ecce0087da fix: Detect streaming state in reasoning content blocks (#21549) 2026-04-07 12:04:41 +02:00
Kabir08
d1f82e382d Fix rtl text rendering (#21382)
* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content

* Clean up commented duplicate function

Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.

* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements

This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.

* Implement rehype plugin for comprehensive RTL text support

- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content

* Fix RTL text rendering with rehype plugin and cleanup

* fix: prettier formatting
2026-04-07 11:37:20 +02:00
PMZFX
0988accf82 [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527)
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.

On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.

The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

Fixes: #21517
b8685
2026-04-07 16:12:49 +08:00
Dmytro Romanov
0033f53a07 docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518) b8684 2026-04-07 12:37:26 +08:00
Masashi Yoshimura
d0a6dfeb28 ggml-webgpu: Add the support of MUL_MAT_ID (#21147)
* Add mul_mat_id support to WebGPU

* Apply suggestion from @reeselevine

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
b8683
2026-04-06 13:08:46 -07:00
Pasha Khosravi
2e1f0a889e ggml: add Q1_0 1-bit quantization support (CPU) (#21273)
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)

* add generic fallback for x86

* remove Q1_0 (group size 32)

* rename Q1_0_g128 => Q1_0

* fix Q1_0 LlamaFileType Enum

* Fix trailing spaces; add generic fallback for othre backends

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix /r/n spacing + arch-fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8682
2026-04-06 20:55:21 +02:00
Bipin Yadav
506200cf8b cli: fix stripping of \n in multiline input (#21485)
* llama-cli: fix stripping of \n in multiline input

* Change & string to string_view

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix EditorConfig linter error

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8681
2026-04-06 20:54:06 +02:00
Gaurav Garg
15f786e658 [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)
* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
b8680
2026-04-06 20:34:29 +02:00
Aman Gupta
94ca829b60 llama-bench: add -fitc and -fitt to arguments (#21304)
* llama-bench: add `-fitc` and `-fitt` to arguments

* update README.md

* address review comments

* update compare-llama-bench.py
b8679
2026-04-06 22:26:02 +08:00