llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-04-09 10:01:54 +02:00

Author	SHA1	Message	Date
Georgi Gerganov	5e9c635463	metal : add missing mm-id specializations for q1_0 (#21662 )	2026-04-09 10:54:00 +03:00
Aleksander Grygier	9949ad08f6	fix: Model Selector choice sync (#21628 )	2026-04-09 09:46:27 +02:00
AUTOMATIC1111	3ee9da0e4f	server : fix grammar commandline args (#21543 ) Co-authored-by: AUTOMATIC <->	2026-04-09 10:16:54 +03:00
Aleksander Grygier	75511a8d7e	webui: Add option to pre-encode conversation for faster next turns (#21034 )	2026-04-09 09:10:18 +02:00
Akarshan Biswas	b54cb2e3d0	sycl : add flash-attn support for head size 512 (#21654 ) * sycl : add flash-attn support for head size 512 This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512. Changes: - Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels. - Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256). - Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`. - Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization. - Added necessary template instances for the new 512 head size across various quantization types. * remove defunct mxfp4 reorder from setting buffer type	2026-04-09 09:36:48 +03:00
Marxist-Leninist	8a65a7a8ee	ci: drop v5 `all:` composition from labeler.yml (#21627 ) actions/labeler@v6 removed the `all:` / `any:` composition keys. The `server/webui` and `server` entries used `all:` to combine `any-glob-to-any-file` with negated `all-globs-to-all-files`, which now errors on every PR with: Unknown config options were under "changed-files": all Flatten both entries to a single `any-glob-to-any-file`. PRs touching both webui and other server files will now receive both labels instead of only `server/webui`. Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>	2026-04-09 08:20:19 +02:00
Ruben Ortlam	8a132faaa0	vulkan: unify type macros to use Vx instead of _VECx (#21605 )	2026-04-09 07:31:51 +02:00
Adrien Gallouët	4293919068	common : skip non-primary GGUF split files when selecting model (#21633 ) We should not assume files are listed in order. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8721	2026-04-09 07:28:06 +02:00
Aman Gupta	d12cc3d1ca	CUDA: also store `node->src->data` ptrs for equality check (#21635 ) * CUDA: also store node->src->data ptrs for equality check * address review comments b8720	2026-04-09 01:01:56 +08:00
RealOrko	2dcb7f74ed	fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (#21592 ) * fix: free ctx_copy in ggml_opt_free to plug per-training-session leak ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every time the allocated graph shape changes. The last ctx_copy from the final ggml_opt_alloc call survives until ggml_opt_free is invoked, but ggml_opt_free was only freeing ctx_static and ctx_cpu, never ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch context — ~900 KB for a typical GNN training session in sindarin-pkg-tensor, surfaced via AddressSanitizer. ctx_copy is nullptr-initialized and ggml_free() handles NULL safely, so the new release is guard-free. * Update ggml/src/ggml-opt.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: realorko <realorko@nowhere.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8719	2026-04-08 17:40:15 +02:00
Yuri Khrustalev	660600081f	server: respect the ignore eos flag (#21203 ) b8718	2026-04-08 17:12:15 +02:00
Aldehir Rojas	d9a12c82f0	vocab : remove </s> eog token if gemma4 (#21492 ) b8717	2026-04-08 09:53:06 -05:00
Georgi Gerganov	4a05e0c566	webui : send both backend_sampling == false/true (#18781 ) * webui : send both backend_sampling == false/true * feat: Parameter sync --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-04-08 16:35:52 +02:00
John Eismeier	e9fd96283d	Propose fix a couple of typos (#21581 ) Signed-off-by: John E <jeis4wpi@outlook.com> b8715	2026-04-08 16:29:03 +02:00
Erik Scholz	3ba12fed0a	kv-cache : extend cache quantization checks (#21586 ) to also check for enabled flash attention, instead of just auto. b8714	2026-04-08 16:08:57 +03:00
Reese Levine	5473949070	webgpu : Query for adapter support when registering WebGPU backend (#21579 ) b8713	2026-04-08 16:08:29 +03:00
Pasha Khosravi	dcdcbad42a	metal: Q1_0 backend (#21528 ) * initial Q1_0 Metal backend * tuning q1_0 metal kernels * add Q1_0 to test-backend-ops * add Q1_0<->F32 copy test * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8712	2026-04-08 16:07:47 +03:00
Georgi Gerganov	5764d7c6a6	gemma : perform per-layer projections in the first layer (#21612 ) * gemma : reduce graph splits by keeping per-layer ops in the input layer * gemma : put the per-layer proj in the first layer * cont : move the projection before the layer loop b8711	2026-04-08 16:06:30 +03:00
Daniel Bevenius	87f4744a80	examples : disable cb_eval callback for --save-logits (#21553 ) This commit updates the debug example to not create the base_callback_data. The motivation for this is when using `--save-logits`, which is used by examples/model-conversion scripts, we often don't care about the tensor outputs and they just add noise to the output. This changes is quiet by default we can always remove --save-logits to get the tensor outputs when debugging. b8710	2026-04-08 14:10:33 +02:00
Piotr Wilkin (ilintar)	85d482e6b6	parser: fix MiniMax handling (#21573 ) b8709	2026-04-08 12:47:25 +02:00
Georgi Gerganov	ae65fbdf33	tests : remove obsolete .mjs script (#21615 ) b8708	2026-04-08 13:20:46 +03:00
Aleksander Grygier	3bd9aa1f92	chore: Update labeler to have separate labels for `server/webui` and `server` changes (#21567 )	2026-04-08 10:35:31 +02:00
Aleksander Grygier	ece522f98c	chore: Remove legacy files (#21606 )	2026-04-08 09:55:08 +02:00
forforever73	09343c0198	model : support step3-vl-10b (#21287 ) * feat: support step3-vl-10b * use fused QKV && mapping tensor in tensor_mapping.py * guard hardcoded params and drop crop metadata * get understand_projector_stride from global config * img_u8_resize_bilinear_to_f32 move in step3vl class * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix the \r\n mess * add width and heads to MmprojModel.set_gguf_parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8705	2026-04-08 09:51:31 +02:00
Hamish M. Blair	97508acb17	webui: fix syntax highlighting lost after streaming for non-common languages (#21206 ) * webui: fix syntax highlighting lost for non-common languages after streaming rehype-highlight uses lowlight internally, which only bundles 37 "common" languages. The streaming code path uses highlight.js directly (192 languages), so languages like Haskell highlight correctly while streaming but lose all color once the code block closes. Pass the full lowlight language set to rehype-highlight so both paths support the same languages. * webui: rebuild static files after rebase	2026-04-08 08:58:08 +02:00
Martin Klacer	5c4aae66e1	devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (#21259 ) * Unified macOS release setup with strategy-matrix block * Added KleidiAI arm64 macOS release definition Change-Id: I05520889ffc646488a178d06817a17f29274465a Signed-off-by: Martin Klacer <martin.klacer@arm.com> b8703	2026-04-08 13:06:12 +08:00
Aman Gupta	c5ce4bc227	CUDA: make cuda graphs props check faster (#21472 ) * CUDA: compute fast hash instead of expensive props check * use seen node * use memcp b8702	2026-04-08 09:05:51 +08:00
iacopPBK	66c4f9ded0	ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168 ) * ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: iacopPBK <iacop@deneb.com> b8701	2026-04-07 21:47:42 +02:00
Daniel Bevenius	93bdc61563	gguf-py : fix missing comma after bad merge in tensor-mapping (#21558 ) This commit adds a missing comma in the vision encoder attention qkv block. The motivation for this change is that without the comma there will be a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL tensor mappings which will be broken.	2026-04-07 21:24:25 +02:00
Georgi Gerganov	4eb19514dd	kv-cache : support attention rotation for heterogeneous iSWA (#21513 ) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert b8699	2026-04-07 20:31:28 +03:00
Reese Levine	957d717ce5	ggml-webgpu: parameterize submission size and add iOS specific limits (#21533 ) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting * Add parameters for different browsers in-flight submissions * Update handling of batch size too * Throttle ios as much as possible * Increase timeout for llvm-pipe testing b8698	2026-04-07 20:30:01 +03:00
Aman Gupta	de1aa6fa73	CUDA: check for buffer overlap before fusing (#21566 ) * CUDA: check for buffer overlap before fusing * use ggml_cuda_check_fusion_memory_ranges b8697	2026-04-08 00:57:04 +08:00
Aaron Teo	69c28f1547	llama-server: fix model params not propagated (#21509 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b8696	2026-04-07 21:39:41 +08:00
Son H. Nguyen	0d049d6a92	unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257 ) * unicode : add custom Qwen2 regex handler to fix segfault on long input std::regex uses recursive backtracking internally, which causes a stack overflow (segfault) when tokenizing long sequences of repeated characters (e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to the std::regex fallback path instead of using a custom handler. Add unicode_regex_split_custom_qwen2() following the established pattern used by gpt2, llama3, kimi_k2, and afmoe custom handlers. Closes: https://github.com/ggml-org/llama.cpp/issues/21113 * cont : remove TODO comment * cont : update comment to reflect original regex * use the correct regex in the comment this time... [no ci] --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-04-07 16:13:38 +03:00
Johannes Gäßler	a8ec0df461	llama: remove per-arch tensor name lists (#21531 ) b8694	2026-04-07 15:02:03 +02:00
Georgi Gerganov	e8f5082697	server : fix restore for checkpoints with pos_min == 0 (#21510 ) b8693	2026-04-07 15:29:17 +03:00
Georgi Gerganov	22fc79134e	ggml : deprecate GGML_OP_ADD1 (#21363 ) * ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check b8692	2026-04-07 15:28:27 +03:00
Tom Overlund	2a619f6fbc	ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868 ) (#20904 ) b8691	2026-04-07 13:54:55 +02:00
mkoker	edd4d9bca5	vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029 ) Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check. b8690	2026-04-07 13:41:29 +02:00
Aldehir Rojas	482192f12d	webui : store reasoning_content so it is sent back in subsequent requests (#21249 )	2026-04-07 13:32:44 +02:00
Antoine Viallon	71a81f6fcc	ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519 ) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA. b8688	2026-04-07 12:18:55 +02:00
Aleksander Grygier	ecce0087da	fix: Detect streaming state in reasoning content blocks (#21549 )	2026-04-07 12:04:41 +02:00
Kabir08	d1f82e382d	Fix rtl text rendering (#21382 ) * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Ensure bidirectional text support for mixed Arabic/English content * Clean up commented duplicate function Remove the commented-out duplicate transformMdastNode function that was left over from refactoring. * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Minor code formatting improvements This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI. * Implement rehype plugin for comprehensive RTL text support - Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children - Replace DOMParser-based approach with efficient HAST tree processing - Remove hardcoded element lists for better maintainability - Ensure proper bidirectional text rendering for mixed RTL/LTR content * Fix RTL text rendering with rehype plugin and cleanup * fix: prettier formatting	2026-04-07 11:37:20 +02:00
PMZFX	0988accf82	[SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527 ) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: #21517 b8685	2026-04-07 16:12:49 +08:00
Dmytro Romanov	0033f53a07	docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518 ) b8684	2026-04-07 12:37:26 +08:00
Masashi Yoshimura	d0a6dfeb28	ggml-webgpu: Add the support of `MUL_MAT_ID` (#21147 ) * Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com> b8683	2026-04-06 13:08:46 -07:00
Pasha Khosravi	2e1f0a889e	ggml: add Q1_0 1-bit quantization support (CPU) (#21273 ) * ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8682	2026-04-06 20:55:21 +02:00
Bipin Yadav	506200cf8b	cli: fix stripping of \n in multiline input (#21485 ) * llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8681	2026-04-06 20:54:06 +02:00
Gaurav Garg	15f786e658	[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159 ) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original b8680	2026-04-06 20:34:29 +02:00
Aman Gupta	94ca829b60	llama-bench: add `-fitc` and `-fitt` to arguments (#21304 ) * llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py b8679	2026-04-06 22:26:02 +08:00

1 2 3 4 5 ...

8728 Commits