12642 Commits

Author SHA1 Message Date
Concedo
4ede3dfea4 added gemma4 thinking template v1.111.2 2026-04-07 19:06:29 +08:00
Concedo
2d3fe0c113 revert my tweak, switch back to henk's original implementation for now, we can explore this again next time. 2026-04-07 19:00:58 +08:00
Concedo
15d269197e Merge commit '506200cf8b5c8419ce97d16dc8c50f4634e21ebe' into concedo_experimental
# Conflicts:
#	docs/multimodal.md
#	scripts/compare-llama-bench.py
#	src/llama-vocab.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
2026-04-07 14:58:36 +08:00
Bipin Yadav
506200cf8b cli: fix stripping of \n in multiline input (#21485)
* llama-cli: fix stripping of \n in multiline input

* Change & string to string_view

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix EditorConfig linter error

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-06 20:54:06 +02:00
Gaurav Garg
15f786e658 [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)
* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
2026-04-06 20:34:29 +02:00
Concedo
5e16453f0c fixed a bug in chat completions think handling 2026-04-07 00:16:34 +08:00
Concedo
e991bc044e updated lite, modify henk fix to allow triggering on missing close only 2026-04-06 23:41:45 +08:00
Aman Gupta
94ca829b60 llama-bench: add -fitc and -fitt to arguments (#21304)
* llama-bench: add `-fitc` and `-fitt` to arguments

* update README.md

* address review comments

* update compare-llama-bench.py
2026-04-06 22:26:02 +08:00
Aldehir Rojas
4aa962e2b0 vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488) 2026-04-06 09:08:37 -05:00
Concedo
a395af65db Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-riscv.yml
#	.github/workflows/build.yml
#	ggml/src/ggml-hexagon/htp/argsort-ops.c
#	ggml/src/ggml-sycl/fattn-tile.hpp
#	tools/mtmd/CMakeLists.txt
2026-04-06 20:56:02 +08:00
Concedo
82cc19e055 calculate some fields before autofit for more accurate estimate 2026-04-06 20:44:37 +08:00
Sigbjørn Skjæret
941146b3f1 convert : fix block_ff_dim retrieval for lfm2 (#21508) 2026-04-06 14:05:18 +02:00
lainon1
482d862bcb server : handle unsuccessful sink.write in chunked stream provider (#21478)
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
2026-04-06 14:03:02 +02:00
Xuan-Son Nguyen
3979f2bb08 docs: add hunyuan-ocr gguf, also add test [no ci] (#21490) 2026-04-06 14:02:37 +02:00
Concedo
f6e712d919 universal gemma4 fix, add memory check 2026-04-06 19:20:44 +08:00
Georgi Gerganov
400ac8e194 convert : set "add bos" == True for Gemma 4 (#21500)
* convert : set "add bos" == True for Gemma 4

* cont : handle old GGUFs
2026-04-06 13:52:07 +03:00
Concedo
a309086735 Revert "increase debug mode truncation limit"
This reverts commit 59f863746d.
2026-04-06 18:51:12 +08:00
henk717
4e30294cb1 Henk's Gemma4 31B Magic (#2096) 2026-04-06 18:49:19 +08:00
Neo Zhang
f51fd36d79 sycl : handle other FA case (#21377) 2026-04-06 13:28:00 +03:00
Concedo
59f863746d increase debug mode truncation limit 2026-04-06 17:57:44 +08:00
Yarden Tal
25eec6f327 hexagon: slight optimization for argosrt output init (#21463) 2026-04-05 18:30:25 -07:00
anchortense
58190cc84d llama : correct platform-independent loading of BOOL metadata (#21428)
* model-loader : fix GGUF bool array conversion

* model-loader : fix remaining GGUF bool pointer uses
2026-04-06 01:40:38 +02:00
Richard Davison
af76639f72 model : add HunyuanOCR support (#21395)
* HunyuanOCR: add support for text and vision models

- Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge
- Add separate HUNYUAN_OCR chat template (content-before-role format)
- Handle HunyuanOCR's invalid pad_token_id=-1 in converter
- Fix EOS/EOT token IDs from generation_config.json
- Support xdrope RoPE scaling type
- Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.)
- Register HunYuanVLForConditionalGeneration for both text and mmproj conversion

* fix proper mapping

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* address comments

* update

* Fix typecheck

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-05 23:32:14 +02:00
Ludovic Henry
761797ffdf ci : use default RISE RISC-V Runners (#21263) 2026-04-05 20:29:48 +02:00
Concedo
63ca37e62a fix assistant prefill logic (+1 squashed commits)
Squashed commits:

[f4963baf5] fix prefills
v1.111.1
2026-04-05 23:25:44 +08:00
ddh0
5d3a4a7da5 server : fix logging of build + system info (#21460)
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
2026-04-05 16:14:02 +02:00
Concedo
53b3bf46e4 fixed a typo 2026-04-05 18:46:30 +08:00
Concedo
9b1f1bbf35 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-vulkan.yml
#	.github/workflows/docker.yml
#	embd_res/templates/google-gemma-4-31B-it-interleaved.jinja
#	embd_res/templates/google-gemma-4-31B-it.jinja
#	tests/test-chat.cpp
2026-04-05 18:46:23 +08:00
Concedo
e555b16549 updated lite better gemma handling 2026-04-05 18:34:22 +08:00
Concedo
49941b6268 handle think streaming for gemma4 2026-04-05 13:48:07 +08:00
Concedo
dc2e6ca2e3 fix header path 2026-04-05 11:02:08 +08:00
Concedo
13e932b241 more fixes for gemma4 2026-04-05 10:34:40 +08:00
M1DNYT3
c08d28d088 ci: lower cuda12 floor to 12.8.1 for broader host compatibility (#21438)
Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan>
2026-04-05 09:04:00 +08:00
Nicholas Sparks
661e9acb36 ci: fix vulkan workflow referencing non-existent action (#21442) 2026-04-05 08:59:51 +08:00
Aldehir Rojas
b8635075ff common : add gemma 4 specialized parser (#21418)
* common : add gemma4 dedicated parser

* cont : add '<|tool_response>' as eog

* cont : emit JSON from Gemma4 tool call AST

* cont : more fixes

* cont : refactor convert function

* cont : refine rules and mapping

* cont : add more tests

* cont : clean up

* cont : remove autoparser gemma4 implementation

* cont : more cleanup

* cont : rename gemma4.jinja to match the others

* cont : add custom template to support interleaved thinking

* cont : preserve reasoning in model turns

* cont : fix initializer error

* cont : fix unused vars

* cont : fix accidental static

* cont : fix specialized_template signature

* fix extra semicolon

* remove debug line and extra space [no ci]
2026-04-04 20:39:00 +02:00
Eso
11bc83229a fix: Autoswap with override configs (#2091)
* fix: Autoswap with overrides

* fix: Autoswap with overrides
2026-04-05 00:43:19 +08:00
Concedo
376aaf258c Merge branch 'upstream' into concedo_experimental 2026-04-04 23:56:02 +08:00
Concedo
6c937c05d9 improve ncmoe / moecpu regex 2026-04-04 23:53:13 +08:00
Concedo
f7c9029668 change env var KOBOLDCPP_PASSWORD to KCPP_PASSWORD names for consistency, same for KOBOLDCPP_ADMINPASSWORD to KCPP_ADMINPASSWORD 2026-04-04 23:36:30 +08:00
Concedo
db8bc40731 add some warnings if shifting fails 2026-04-04 23:16:26 +08:00
Concedo
d3d50a7b3c fixed reasoning content response in fakestreaming tools 2026-04-04 23:03:33 +08:00
Concedo
ac92ac22d7 tool call fix 2026-04-04 22:35:03 +08:00
Concedo
eb3422996a BOS fix for gemma4 2026-04-04 22:15:01 +08:00
Dan Hoffman
9c699074c9 server: Fix undefined timing measurement errors in server context (#21201)
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
2026-04-04 22:11:19 +08:00
Adrien Gallouët
d01f6274c0 common : respect specified tag, only fallback when tag is empty (#21413)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-04 15:08:03 +02:00
SamareshSingh
650bf14eb9 llama-model: read final_logit_softcapping for Gemma 4 (#21390) 2026-04-04 13:05:10 +02:00
Aman Gupta
b7ad48ebda llama: add custom newline split for Gemma 4 (#21406) 2026-04-04 15:06:34 +08:00
Concedo
2e4f94822e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/docker.yml
#	ci/run.sh
#	docs/build.md
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	src/llama-vocab.cpp
#	tests/test-chat.cpp
#	tests/test-jinja.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-04-04 14:27:23 +08:00
Concedo
235ec9a1b9 updated lite 2026-04-04 14:24:05 +08:00
Concedo
a33eda3842 more template fixes for the gemma4 31b 2026-04-04 14:23:16 +08:00