Compare commits

..

5 Commits

Author SHA1 Message Date
ddh0
9d34231bb8 llama-quant : default ftype param Q5_1 --> Q8_0 (#20828)
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.

In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Georgi Gerganov
8ea8fee966 gitignore : add .pi + personal SYSTEM.md (#22316)
* gitignore : add .pi + personal SYSTEM.md

* cont : fix requirements heading in PR template

* cont : shorten line
2026-04-25 09:20:45 +03:00
Neo Zhang
eddd7a13a5 [SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291)
* opt arc770 for Q4_0

* add for Q4_0

* update the script

* add help script for windows

* update guide

* fix format issue

* convert from dos to unix for format issue

* fix missed -sm parameter
2026-04-25 09:20:14 +03:00
Reese Levine
dd2914dc81 ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327)
* Implement ssm_scan

* Remove blocking in graph_compute and check for set rows

* Fix bindings

* Update op support
2026-04-25 09:18:15 +03:00
Piotr Wilkin (ilintar)
0adede866d parser: fix structured output bug (#22302)
* fix very stupid structured output bug

* Things just cannot be too easy.
2026-04-24 23:19:55 +02:00
20 changed files with 4903 additions and 1713 deletions

View File

@@ -6,7 +6,7 @@
<!-- You can provide more details and link related discussions here. Delete this section if not applicable -->
# Requirements
## Requirements
<!-- IMPORTANT: Please do NOT delete this section, otherwise your PR may be rejected -->

12
.gitignore vendored
View File

@@ -34,7 +34,6 @@
/.vscode/
/nppBackup
# Coverage
/gcovr-report/
@@ -74,6 +73,7 @@
!/models/templates
# Zig
/zig-out/
/zig-cache/
@@ -93,6 +93,7 @@
!/examples/sycl/*.sh
# Server Web UI temporary files
/tools/server/webui/node_modules
/tools/server/webui/dist
# we no longer use gz for index.html
@@ -106,9 +107,11 @@ __pycache__/
poetry.toml
# Nix
/result
# Test binaries
/tests/test-backend-ops
/tests/test-double-float
/tests/test-grad0
@@ -124,6 +127,7 @@ poetry.toml
/tests/test-tokenizer-1-spm
# Scripts
!/scripts/install-oneapi.bat
# Generated by scripts
@@ -132,18 +136,24 @@ poetry.toml
/wikitext-2-raw/
# Test models for lora adapters
/lora-tests
# Local scripts
/run-vim.sh
/run-chat.sh
/run-spec.sh
/.ccache/
# IDE
/*.code-workspace
/.windsurf/
# emscripten
a.out.*
# AGENTS
AGENTS.local.md
.pi/SYSTEM.md

33
.pi/gg/SYSTEM.md Normal file
View File

@@ -0,0 +1,33 @@
You are a coding agent. Here are some very important rules that you must follow:
General:
- By very precise and concise when writing code, comments, explanations, etc.
- PR and commit titles format: `<module> : <title>`. Lookup recents for examples
- Don't try to build or run the code unless you are explicitly asked to do so
Coding:
- When in doubt, always refer to the CONTRIBUTING.md file of the project
- When referencing issues or PRs in comments, use the format:
- C/C++ code: `// ref: <url>`
- Other (CMake, etc.): `# ref: <url>`
Pull requests (PRs):
- New branch names are prefixed with "gg/"
- Before opening a pull request, ask the user to confirm the description
- When creating a pull request, look for the repository's PR template and follow it
- For the AI usage disclosure section, write "YES. llama.cpp + pi"
- Always create the pull requests in draft mode
Commits:
- On every commit that you make, include a "Assisted-by: llama.cpp:local pi" tag
- Do not explicitly set the git author in commits - rely on the default git config
Resources (read on demand):
- [CONTRIBUTING.md](CONTRIBUTING.md)
- [Build documentation](docs/build.md)
- [Server usage documentation](tools/server/README.md)
- [Server development documentation](tools/server/README-dev.md)
- [PEG parser](docs/development/parsing.md)
- [Auto parser](docs/autoparser.md)
- [Jinja engine](common/jinja/README.md)
- [PR template](.github/pull_request_template.md)

View File

@@ -51,6 +51,12 @@ The packages for FP32 and FP16 would have different accuracy and performance on
## News
- 2026.04
- Optimize mul_mat by reorder feature for data type: Q4_K, Q5_K, Q_K, Q8_0.
- Fused MoE.
- Upgrate CI and built package for oneAPI 2025.3.3, support Ubuntu 24.04 built package.
- 2026.03
- Support Flash-Attention: less memory usage, performance impact depends on LLM.
@@ -349,6 +355,12 @@ Choose one of following methods to run.
./examples/sycl/test.sh
```
- Run llama-server:
```sh
./examples/sycl/start-svr.sh -m PATH/MODEL_FILE
```
2. Command line
Launch inference
@@ -637,10 +649,18 @@ Choose one of following methods to run.
1. Script
- Run test:
```
examples\sycl\win-test.bat
```
- Run llama-server:
```
examples\sycl\win-start-svr.bat -m PATH\MODEL_FILE
```
2. Command line
Launch inference

View File

@@ -26,7 +26,7 @@ Legend:
| CLAMP | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
@@ -60,7 +60,7 @@ Legend:
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ | ❌ |
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| L2_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ | ❌ |
@@ -105,7 +105,7 @@ Legend:
| SQR | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SSM_CONV | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | | ❌ | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SUB | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SUM | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | 🟡 | 🟡 | 🟡 | ❌ | ❌ |

File diff suppressed because it is too large Load Diff

124
examples/sycl/start-svr.sh Executable file
View File

@@ -0,0 +1,124 @@
#!/bin/bash
# MIT license
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: MIT
Help() {
cat << EOF
Usage: $(basename "$0") [OPTIONS]
This script processes files with specified options.
Options:
-h, --help Display this help message and exit.
-c, --context <value> Set context length. Bigger need more memory.
-p, --promote <value> Prompt to start generation with.
-m, --model <value> Full model file path.
-mg,--main-gpu <value> Set main GPU ID (0 - n) for single GPU mode.
-sm,--split-mode <value> How to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
-ngl,--n-gpu-layers <value> Max. number of layers to store in VRAM (default: -1)
-lv,--log-verbosity <value> Set the verbosity threshold. Messages with a higher verbosity will be
ignored. Values:
- 0: generic output
- 1: error
- 2: warning
- 3: info
- 4: debug
EOF
}
BIN_FILE=./build/bin/llama-server
SEED=0
GPUS_SETTING=""
MODEL_FILE=../models/Qwen3.5-4B-Q4_0.gguf
NGL=99
CONTEXT=4096
GGML_SYCL_DEVICE=-1
SPLIT_MODE=layer
LOG_VERBOSE=3
while [[ $# -gt 0 ]]; do
case "$1" in
-c|--context)
CONTEXT=$2
# Shift twice to consume both the option flag and its value
shift
shift
;;
-m|--model)
MODEL_FILE="$2"
# Shift twice to consume both the option flag and its value
shift
shift
;;
-mg|--main-gpu)
GGML_SYCL_DEVICE=$2
SPLIT_MODE=none
# Shift twice to consume both the option flag and its value
shift
shift
;;
-sm|--split-mode)
SPLIT_MODE=$2
# Shift twice to consume both the option flag and its value
shift
shift
;;
-ngl|--n-gpu-layers)
NGL=$2
# Shift twice to consume both the option flag and its value
shift
shift
;;
-lv|--log-verbosity)
LOG_VERBOSE=$2
# Shift twice to consume both the option flag and its value
shift
shift
;;
-h|--help)
Help
exit 0
;;
*)
# Handle unknown options or stop processing options
echo "Invalid option: $1"
# Optional: exit script or shift to treat remaining as positional args
exit 1
;;
esac
done
source /opt/intel/oneapi/setvars.sh
#export GGML_SYCL_DEBUG=1
#ZES_ENABLE_SYSMAN=1, Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory. Recommended to use when --split-mode = layer.
#support malloc device memory more than 4GB.
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
echo "UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=${UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS}"
if [ $GGML_SYCL_DEVICE -ne -1 ]; then
echo "Use $GGML_SYCL_DEVICE as main GPU"
#use signle GPU only
GPUS_SETTING="-mg $GGML_SYCL_DEVICE -sm ${SPLIT_MODE}"
export ONEAPI_DEVICE_SELECTOR="level_zero:${$GGML_SYCL_DEVICE}"
echo "ONEAPI_DEVICE_SELECTOR=${ONEAPI_DEVICE_SELECTOR}"
else
echo "Use all Intel GPUs, including iGPU & dGPU"
GPUS_SETTING="-sm ${SPLIT_MODE}"
fi
echo "run cmd: ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 200 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap "
ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap --host 0.0.0.0 --port 8000

View File

@@ -38,7 +38,7 @@ SEED=0
GPUS_SETTING=""
INPUT_PROMPT="Building a website can be done in 10 simple steps:\nStep 1:"
MODEL_FILE=models/llama-2-7b.Q4_0.gguf
MODEL_FILE=../models/llama-2-7b.Q4_0.gguf
NGL=99
CONTEXT=4096
GGML_SYCL_DEVICE=-1
@@ -122,9 +122,10 @@ if [ $GGML_SYCL_DEVICE -ne -1 ]; then
export ONEAPI_DEVICE_SELECTOR="level_zero:${$GGML_SYCL_DEVICE}"
echo "ONEAPI_DEVICE_SELECTOR=${ONEAPI_DEVICE_SELECTOR}"
else
echo "Use all Intel GPUs, including iGPU & dGPU"
echo "Use all Intel GPUs, including iGPU & dGPU"
GPUS_SETTING="-sm ${SPLIT_MODE}"
fi
echo "run cmd: ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap "
ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap
echo "run cmd: ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 200 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap "
ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 200 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap

View File

@@ -0,0 +1,179 @@
:: MIT license
:: Copyright (C) 2024 Intel Corporation
:: SPDX-License-Identifier: MIT
@echo off
setlocal EnableExtensions EnableDelayedExpansion
set "BIN_FILE=.\build\bin\llama-server.exe"
set "SEED=0"
set "GPUS_SETTING="
set "MODEL_FILE=..\models\Qwen3.5-4B-Q4_0.gguf"
set "NGL=99"
set "CONTEXT=4096"
set "GGML_SYCL_DEVICE=-1"
set "SPLIT_MODE=layer"
set "LOG_VERBOSE=3"
if "%~1"=="" goto after_args
:parse_args
if "%~1"=="" goto after_args
if /I "%~1"=="-c" (
if "%~2"=="" goto missing_value
set "CONTEXT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--context" (
if "%~2"=="" goto missing_value
set "CONTEXT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-m" (
if "%~2"=="" goto missing_value
set "MODEL_FILE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--model" (
if "%~2"=="" goto missing_value
set "MODEL_FILE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-mg" (
if "%~2"=="" goto missing_value
set "GGML_SYCL_DEVICE=%~2"
set "SPLIT_MODE=none"
shift
shift
goto parse_args
)
if /I "%~1"=="--main-gpu" (
if "%~2"=="" goto missing_value
set "GGML_SYCL_DEVICE=%~2"
set "SPLIT_MODE=none"
shift
shift
goto parse_args
)
if /I "%~1"=="-sm" (
if "%~2"=="" goto missing_value
set "SPLIT_MODE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--split-mode" (
if "%~2"=="" goto missing_value
set "SPLIT_MODE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-ngl" (
if "%~2"=="" goto missing_value
set "NGL=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--n-gpu-layers" (
if "%~2"=="" goto missing_value
set "NGL=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-lv" (
if "%~2"=="" goto missing_value
set "LOG_VERBOSE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--log-verbosity" (
if "%~2"=="" goto missing_value
set "LOG_VERBOSE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-h" goto help
if /I "%~1"=="--help" goto help
echo Invalid option: %~1
exit /b 1
:missing_value
echo Missing value for option: %~1
exit /b 1
:help
echo Usage: %~n0 [OPTIONS]
echo.
echo This script processes files with specified options.
echo.
echo Options:
echo -h, --help Display this help message and exit.
echo -c, --context ^<value^> Set context length. Bigger need more memory.
echo -m, --model ^<value^> Full model file path.
echo -mg,--main-gpu ^<value^> Set main GPU ID (0 - n) for single GPU mode.
echo -sm,--split-mode ^<value^> How to split the model across multiple GPUs, one of:
echo - none: use one GPU only
echo - layer (default): split layers and KV across GPUs
echo - row: split rows across GPUs
echo -ngl,--n-gpu-layers ^<value^> Max. number of layers to store in VRAM (default: -1)
echo -lv,--log-verbosity ^<value^> Set the verbosity threshold. Messages with a higher verbosity will be
echo ignored. Values:
echo - 0: generic output
echo - 1: error
echo - 2: warning
echo - 3: info
echo - 4: debug
exit /b 0
:after_args
REM In Windows CMD, source is not available; call oneAPI setvars if present.
if exist "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" (
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" >nul
) else (
echo Warning: oneAPI setvars.bat not found. Continuing without environment setup.
)
REM Support malloc device memory more than 4GB.
set "UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1"
echo UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=%UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS%
if not "%GGML_SYCL_DEVICE%"=="-1" (
echo Use %GGML_SYCL_DEVICE% as main GPU
REM Use single GPU only.
set "GPUS_SETTING=-mg %GGML_SYCL_DEVICE% -sm %SPLIT_MODE%"
set "ONEAPI_DEVICE_SELECTOR=level_zero:%GGML_SYCL_DEVICE%"
echo ONEAPI_DEVICE_SELECTOR=%ONEAPI_DEVICE_SELECTOR%
) else (
echo Use all Intel GPUs, including iGPU ^& dGPU
set "GPUS_SETTING=-sm %SPLIT_MODE%"
)
echo run cmd: ZES_ENABLE_SYSMAN=1 %BIN_FILE% -m "%MODEL_FILE%" -ngl %NGL% -s %SEED% -c %CONTEXT% %GPUS_SETTING% -lv %LOG_VERBOSE% --mmap --host 0.0.0.0 --port 8000
set "ZES_ENABLE_SYSMAN=1"
%BIN_FILE% -m "%MODEL_FILE%" -ngl %NGL% -s %SEED% -c %CONTEXT% %GPUS_SETTING% -lv %LOG_VERBOSE% --mmap --host 0.0.0.0 --port 8000
endlocal

View File

@@ -2,10 +2,200 @@
:: Copyright (C) 2024 Intel Corporation
:: SPDX-License-Identifier: MIT
set INPUT2="Building a website can be done in 10 simple steps:\nStep 1:"
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
:: support malloc device memory more than 4GB.
set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
set LOAD_MODE="--mmap"
.\build\bin\llama-completion.exe -m models\llama-2-7b.Q4_0.gguf -no-cnv -p %INPUT2% -n 400 -e -ngl 99 -s 0 %LOAD_MODE%
@echo off
setlocal EnableExtensions EnableDelayedExpansion
REM MIT license
REM Copyright (C) 2024 Intel Corporation
REM SPDX-License-Identifier: MIT
set "BIN_FILE=.\build\bin\llama-completion.exe"
set "SEED=0"
set "GPUS_SETTING="
set "INPUT_PROMPT=Building a website can be done in 10 simple steps:^nStep 1:"
set "MODEL_FILE=..\models\llama-2-7b.Q4_0.gguf"
set "NGL=99"
set "CONTEXT=4096"
set "GGML_SYCL_DEVICE=-1"
set "SPLIT_MODE=layer"
set "LOG_VERBOSE=3"
if "%~1"=="" goto after_args
:parse_args
if "%~1"=="" goto after_args
if /I "%~1"=="-c" (
if "%~2"=="" goto missing_value
set "CONTEXT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--context" (
if "%~2"=="" goto missing_value
set "CONTEXT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-p" (
if "%~2"=="" goto missing_value
set "INPUT_PROMPT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--promote" (
if "%~2"=="" goto missing_value
set "INPUT_PROMPT=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-m" (
if "%~2"=="" goto missing_value
set "MODEL_FILE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--model" (
if "%~2"=="" goto missing_value
set "MODEL_FILE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-mg" (
if "%~2"=="" goto missing_value
set "GGML_SYCL_DEVICE=%~2"
set "SPLIT_MODE=none"
shift
shift
goto parse_args
)
if /I "%~1"=="--main-gpu" (
if "%~2"=="" goto missing_value
set "GGML_SYCL_DEVICE=%~2"
set "SPLIT_MODE=none"
shift
shift
goto parse_args
)
if /I "%~1"=="-sm" (
if "%~2"=="" goto missing_value
set "SPLIT_MODE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--split-mode" (
if "%~2"=="" goto missing_value
set "SPLIT_MODE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-ngl" (
if "%~2"=="" goto missing_value
set "NGL=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--n-gpu-layers" (
if "%~2"=="" goto missing_value
set "NGL=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-lv" (
if "%~2"=="" goto missing_value
set "LOG_VERBOSE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="--log-verbosity" (
if "%~2"=="" goto missing_value
set "LOG_VERBOSE=%~2"
shift
shift
goto parse_args
)
if /I "%~1"=="-h" goto help
if /I "%~1"=="--help" goto help
echo Invalid option: %~1
exit /b 1
:missing_value
echo Missing value for option: %~1
exit /b 1
:help
echo Usage: %~n0 [OPTIONS]
echo.
echo This script processes files with specified options.
echo.
echo Options:
echo -h, --help Display this help message and exit.
echo -c, --context ^<value^> Set context length. Bigger need more memory.
echo -p, --promote ^<value^> Prompt to start generation with.
echo -m, --model ^<value^> Full model file path.
echo -mg,--main-gpu ^<value^> Set main GPU ID (0 - n) for single GPU mode.
echo -sm,--split-mode ^<value^> How to split the model across multiple GPUs, one of:
echo - none: use one GPU only
echo - layer (default): split layers and KV across GPUs
echo - row: split rows across GPUs
echo -ngl,--n-gpu-layers ^<value^> Max. number of layers to store in VRAM (default: -1)
echo -lv,--log-verbosity ^<value^> Set the verbosity threshold. Messages with a higher verbosity will be
echo ignored. Values:
echo - 0: generic output
echo - 1: error
echo - 2: warning
echo - 3: info
echo - 4: debug
exit /b 0
:after_args
REM In Windows CMD, source is not available; call oneAPI setvars if present.
if exist "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" (
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" >nul
) else (
echo Warning: oneAPI setvars.bat not found. Continuing without environment setup.
)
REM Support malloc device memory more than 4GB.
set "UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1"
echo UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=%UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS%
if not "%GGML_SYCL_DEVICE%"=="-1" (
echo Use %GGML_SYCL_DEVICE% as main GPU
REM Use single GPU only.
set "GPUS_SETTING=-mg %GGML_SYCL_DEVICE% -sm %SPLIT_MODE%"
set "ONEAPI_DEVICE_SELECTOR=level_zero:%GGML_SYCL_DEVICE%"
echo ONEAPI_DEVICE_SELECTOR=%ONEAPI_DEVICE_SELECTOR%
) else (
echo Use all Intel GPUs, including iGPU ^& dGPU
set "GPUS_SETTING=-sm %SPLIT_MODE%"
)
echo run cmd: ZES_ENABLE_SYSMAN=1 %BIN_FILE% -m %MODEL_FILE% -no-cnv -p "%INPUT_PROMPT%" -n 200 -e -ngl %NGL% -s %SEED% -c %CONTEXT% %GPUS_SETTING% -lv %LOG_VERBOSE% --mmap
set "ZES_ENABLE_SYSMAN=1"
%BIN_FILE% -m "%MODEL_FILE%" -no-cnv -p "%INPUT_PROMPT%" -n 200 -e -ngl %NGL% -s %SEED% -c %CONTEXT% %GPUS_SETTING% -lv %LOG_VERBOSE% --mmap
endlocal

View File

@@ -224,7 +224,7 @@ struct sycl_device_info {
// cudaOccupancyMaxActiveBlocksPerMultiprocessor
bool vmm; // virtual memory support
size_t total_vram;
//sycl_hw_info hw_info; \\ device id and aarch, currently not used
sycl_hw_info hw_info;
optimize_feature opt_feature;
};

View File

@@ -104,6 +104,7 @@ static ggml_sycl_device_info ggml_sycl_init() {
info.max_work_group_sizes[i] = prop.get_max_work_group_size();
info.devices[i].max_wg_per_cu = info.max_work_group_sizes[i] / prop.get_max_compute_units();
info.devices[i].hw_info = get_device_hw_info(&device);
}
@@ -3703,9 +3704,16 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
if (!g_ggml_sycl_prioritize_dmmv && ((should_reorder_tensor(ctx, dst) &&
ggml_sycl_supports_reorder_mmvq(src0->type)))) {
use_dequantize_mul_mat_vec = use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
// Arc770 get benefit with Q4_0 by skipping it.
if (!(ggml_sycl_info().devices[ctx.device].hw_info.arch ==
gpu_arch::intel_gpu_acm_g10 &&
src0->type == GGML_TYPE_Q4_0)) {
use_dequantize_mul_mat_vec =
use_dequantize_mul_mat_vec && !use_mul_mat_vec_q;
}
}
if (!split && src0->type == GGML_TYPE_F16 && ggml_is_permuted(src0) && ggml_is_permuted(src1) && src1->ne[1] == 1) {

View File

@@ -1,15 +1,67 @@
#include "sycl_hw.hpp"
// TODO: currently not used
/*
sycl_hw_info get_device_hw_info(sycl::device *device_ptr) {
sycl_hw_info res;
int32_t id = device_ptr->get_info<sycl::ext::intel::info::device::device_id>();
res.device_id = id;
using namespace std;
syclex::architecture arch = device_ptr->get_info<syclex::info::device::architecture>();
res.arch = arch;
return res;
}
/*defined in
* /opt/intel/oneapi/compiler/latest/include/sycl/ext/oneapi/experimental/device_architecture.def
*/
static map<gpu_arch, std::pair<const char*, sycl_intel_gpu_family>> arch2name = {
{gpu_arch::intel_gpu_bdw, {"intel_gpu_bdw", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_skl, {"intel_gpu_skl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_kbl, {"intel_gpu_kbl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_cfl, {"intel_gpu_cfl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_apl, {"intel_gpu_apl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_glk, {"intel_gpu_glk", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_whl, {"intel_gpu_whl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_aml, {"intel_gpu_aml", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_cml, {"intel_gpu_cml", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_icllp, {"intel_gpu_icllp", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_ehl, {"intel_gpu_ehl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_tgllp, {"intel_gpu_tgllp", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_rkl, {"intel_gpu_rkl", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_adl_s, {"intel_gpu_adl_s", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_adl_p, {"intel_gpu_adl_p", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_adl_n, {"intel_gpu_adl_n", GPU_FAMILY_IGPU_NON_XE}},
{gpu_arch::intel_gpu_dg1, {"intel_gpu_dg1", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_acm_g10, {"intel_gpu_acm_g10", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_acm_g11, {"intel_gpu_acm_g11", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_acm_g12, {"intel_gpu_acm_g12", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_pvc, {"intel_gpu_pvc", GPU_FAMILY_DGPU_CLOUD}},
{gpu_arch::intel_gpu_pvc_vg, {"intel_gpu_pvc_vg", GPU_FAMILY_DGPU_CLOUD}},
{gpu_arch::intel_gpu_mtl_u, {"intel_gpu_mtl_u", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_mtl_h, {"intel_gpu_mtl_h", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_arl_h, {"intel_gpu_arl_h", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_bmg_g21, {"intel_gpu_bmg_g21", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_bmg_g31, {"intel_gpu_bmg_g31", GPU_FAMILY_DGPU_CLIENT_GAME}},
{gpu_arch::intel_gpu_lnl_m, {"intel_gpu_lnl_m", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_ptl_h, {"intel_gpu_ptl_h", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_ptl_u, {"intel_gpu_ptl_u", GPU_FAMILY_IGPU_XE}},
{gpu_arch::intel_gpu_wcl, {"intel_gpu_wcl", GPU_FAMILY_IGPU_XE}}
};
sycl_hw_info get_device_hw_info(sycl::device* device_ptr) {
sycl_hw_info res;
int32_t id =
device_ptr->get_info<sycl::ext::intel::info::device::device_id>();
res.device_id = id;
res.name = device_ptr->get_info<sycl::info::device::name>();
syclex::architecture arch =
device_ptr->get_info<syclex::info::device::architecture>();
res.arch = arch;
map<syclex::architecture,
std::pair<const char*, sycl_intel_gpu_family>>::iterator it =
arch2name.find(res.arch);
if (it != arch2name.end()) {
res.arch_name = it->second.first;
res.gpu_family = it->second.second;
} else {
res.arch_name = "unknown";
res.gpu_family = GPU_FAMILY_UKNOWN;
}
return res;
}

View File

@@ -9,18 +9,30 @@
#include <sycl/sycl.hpp>
namespace syclex = sycl::ext::oneapi::experimental;
using gpu_arch = sycl::ext::oneapi::experimental::architecture;
// TODO: currently not used
/*
struct sycl_hw_info {
syclex::architecture arch;
int32_t device_id;
// It's used to mark the GPU computing capacity
// The value must flow the order of performance.
enum sycl_intel_gpu_family {
GPU_FAMILY_UKNOWN = -1,
// iGPU without Xe core, before Meteor Lake iGPU(Xe)
GPU_FAMILY_IGPU_NON_XE = 0,
// iGPU with Xe core, Meteor Lake iGPU or newer.
GPU_FAMILY_IGPU_XE = 1,
// dGPU for gaming in client/data center (DG1/FLex 140 or newer).
GPU_FAMILY_DGPU_CLIENT_GAME = 2,
// dGPU for AI in cloud, PVC or newer.
GPU_FAMILY_DGPU_CLOUD = 3
};
bool is_in_vector(std::vector<int> &vec, int item);
struct sycl_hw_info {
syclex::architecture arch;
const char* arch_name;
int32_t device_id;
std::string name;
sycl_intel_gpu_family gpu_family;
};
sycl_hw_info get_device_hw_info(sycl::device *device_ptr);
*/
#endif // SYCL_HW_HPP

View File

@@ -98,6 +98,29 @@ struct ggml_webgpu_ssm_conv_shader_decisions {
uint32_t tokens_per_wg;
};
struct ggml_webgpu_ssm_scan_pipeline_key {
int type;
int d_state;
bool operator==(const ggml_webgpu_ssm_scan_pipeline_key & other) const {
return type == other.type && d_state == other.d_state;
}
};
struct ggml_webgpu_ssm_scan_pipeline_key_hash {
size_t operator()(const ggml_webgpu_ssm_scan_pipeline_key & key) const {
size_t seed = 0;
ggml_webgpu_hash_combine(seed, key.type);
ggml_webgpu_hash_combine(seed, key.d_state);
return seed;
}
};
struct ggml_webgpu_ssm_scan_shader_decisions {
uint32_t wg_size;
uint32_t tokens_per_tile;
};
/** Argsort **/
struct ggml_webgpu_argsort_shader_lib_context {
@@ -921,6 +944,8 @@ class ggml_webgpu_shader_lib {
solve_tri_pipelines; // type
std::unordered_map<ggml_webgpu_ssm_conv_pipeline_key, webgpu_pipeline, ggml_webgpu_ssm_conv_pipeline_key_hash>
ssm_conv_pipelines; // type/vectorized
std::unordered_map<ggml_webgpu_ssm_scan_pipeline_key, webgpu_pipeline, ggml_webgpu_ssm_scan_pipeline_key_hash>
ssm_scan_pipelines; // type/d_state
std::unordered_map<ggml_webgpu_gated_delta_net_pipeline_key,
webgpu_pipeline,
ggml_webgpu_gated_delta_net_pipeline_key_hash>
@@ -1433,6 +1458,53 @@ class ggml_webgpu_shader_lib {
return ssm_conv_pipelines[key];
}
webgpu_pipeline get_ssm_scan_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_ssm_scan_pipeline_key key = {};
key.type = context.dst->type;
key.d_state = (int) context.src0->ne[0];
auto it = ssm_scan_pipelines.find(key);
if (it != ssm_scan_pipelines.end()) {
return it->second;
}
std::vector<std::string> defines;
std::string variant = "ssm_scan";
switch (key.type) {
case GGML_TYPE_F32:
variant += "_f32";
break;
default:
GGML_ABORT("Unsupported type for ssm_scan shader");
}
const uint32_t wg_size = (uint32_t) key.d_state;
constexpr uint32_t tokens_per_tile = 4u;
defines.push_back("WG_SIZE=" + std::to_string(wg_size) + "u");
defines.push_back("TOKENS_PER_TILE=" + std::to_string(tokens_per_tile) + "u");
if (context.supports_subgroups) {
defines.push_back("USE_SUBGROUP_REDUCTION");
variant += "_sg_reduce";
} else {
variant += "_wg_reduce";
}
variant += "_d" + std::to_string(key.d_state);
auto processed = preprocessor.preprocess(wgsl_ssm_scan, defines);
auto decisions = std::make_shared<ggml_webgpu_ssm_scan_shader_decisions>();
decisions->wg_size = wg_size;
decisions->tokens_per_tile = tokens_per_tile;
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed, variant);
pipeline.context = decisions;
ssm_scan_pipelines[key] = pipeline;
return ssm_scan_pipelines[key];
}
webgpu_pipeline get_gated_delta_net_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_gated_delta_net_pipeline_key key = {};
key.type = context.dst->type;

View File

@@ -1115,6 +1115,80 @@ static webgpu_encoded_op ggml_webgpu_ssm_conv(webgpu_context & ctx,
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
static webgpu_encoded_op ggml_webgpu_ssm_scan(webgpu_context & ctx,
ggml_tensor * src0,
ggml_tensor * src1,
ggml_tensor * src2,
ggml_tensor * src3,
ggml_tensor * src4,
ggml_tensor * src5,
ggml_tensor * src6,
ggml_tensor * dst) {
ggml_webgpu_shader_lib_context shader_lib_ctx = {};
shader_lib_ctx.src0 = src0;
shader_lib_ctx.dst = dst;
shader_lib_ctx.max_wg_size = ctx->global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup;
shader_lib_ctx.supports_subgroups = ctx->global_ctx->capabilities.supports_subgroups;
webgpu_pipeline pipeline = ctx->shader_lib->get_ssm_scan_pipeline(shader_lib_ctx);
std::vector<uint32_t> params = {
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src0) / ggml_type_size(src0->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src1) / ggml_type_size(src1->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src2) / ggml_type_size(src2->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src3) / ggml_type_size(src3->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src4) / ggml_type_size(src4->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src5) / ggml_type_size(src5->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src6) / ggml_type_size(src6->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, dst) / ggml_type_size(dst->type)),
(uint32_t) (src0->nb[1] / ggml_type_size(src0->type)),
(uint32_t) (src0->nb[2] / ggml_type_size(src0->type)),
(uint32_t) (src0->nb[3] / ggml_type_size(src0->type)),
(uint32_t) (src1->nb[1] / ggml_type_size(src1->type)),
(uint32_t) (src1->nb[2] / ggml_type_size(src1->type)),
(uint32_t) (src1->nb[3] / ggml_type_size(src1->type)),
(uint32_t) (src2->nb[1] / ggml_type_size(src2->type)),
(uint32_t) (src2->nb[2] / ggml_type_size(src2->type)),
(uint32_t) src3->ne[0],
(uint32_t) (src3->nb[1] / ggml_type_size(src3->type)),
(uint32_t) (src4->nb[1] / ggml_type_size(src4->type)),
(uint32_t) (src4->nb[2] / ggml_type_size(src4->type)),
(uint32_t) (src4->nb[3] / ggml_type_size(src4->type)),
(uint32_t) (src5->nb[1] / ggml_type_size(src5->type)),
(uint32_t) (src5->nb[2] / ggml_type_size(src5->type)),
(uint32_t) (src5->nb[3] / ggml_type_size(src5->type)),
(uint32_t) src0->ne[0],
(uint32_t) src0->ne[1],
(uint32_t) src0->ne[2],
(uint32_t) src4->ne[1],
(uint32_t) src1->ne[2],
(uint32_t) src1->ne[3],
(uint32_t) ggml_nelements(src1),
};
std::vector<wgpu::BindGroupEntry> entries = {
ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src0), ggml_webgpu_make_tensor_bind_group_entry(ctx, 1, src1),
ggml_webgpu_make_tensor_bind_group_entry(ctx, 2, src2), ggml_webgpu_make_tensor_bind_group_entry(ctx, 3, src3),
ggml_webgpu_make_tensor_bind_group_entry(ctx, 4, src4), ggml_webgpu_make_tensor_bind_group_entry(ctx, 5, src5),
ggml_webgpu_make_tensor_bind_group_entry(ctx, 6, src6), ggml_webgpu_make_tensor_bind_group_entry(ctx, 7, dst),
};
const uint32_t total_wg = (uint32_t) (src0->ne[1] * src0->ne[2] * src1->ne[3]);
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
uint32_t wg_x;
uint32_t wg_y;
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
static webgpu_encoded_op ggml_webgpu_gated_delta_net(webgpu_context & ctx,
ggml_tensor * src0,
ggml_tensor * src1,
@@ -2764,6 +2838,9 @@ static std::optional<webgpu_encoded_op> ggml_webgpu_encode(webgpu_context ctx,
return ggml_webgpu_solve_tri(ctx, src0, src1, node);
case GGML_OP_SSM_CONV:
return ggml_webgpu_ssm_conv(ctx, src0, src1, node);
case GGML_OP_SSM_SCAN:
return ggml_webgpu_ssm_scan(ctx, src0, src1, src2, node->src[3], node->src[4], node->src[5], node->src[6],
node);
case GGML_OP_GATED_DELTA_NET:
return ggml_webgpu_gated_delta_net(ctx, src0, src1, src2, node->src[3], node->src[4], node->src[5], node);
case GGML_OP_PAD:
@@ -2822,7 +2899,10 @@ static void ggml_backend_webgpu_collect_profile_results(webgpu_context &
}
#endif
// Don't bother checking set_rows index overflow for now, since practically the WebGPU doesn't need to support
// models that would require it right now.
static void ggml_backend_webgpu_check_set_rows(webgpu_context & ctx, uint32_t & num_inflight_batches) {
#ifdef GGML_WEBGPU_CHECK_SET_ROWS
wgpu::CommandEncoder encoder = ctx->global_ctx->device.CreateCommandEncoder();
encoder.CopyBufferToBuffer(ctx->set_rows_dev_error_buf, 0, ctx->set_rows_host_error_buf, 0,
ctx->set_rows_host_error_buf.GetSize());
@@ -2835,6 +2915,10 @@ static void ggml_backend_webgpu_check_set_rows(webgpu_context & ctx, uint32_t &
GGML_ABORT("ggml_webgpu: SET_ROWS index > 2^32, unsupported.");
}
ctx->set_rows_host_error_buf.Unmap();
#else
GGML_UNUSED(ctx);
GGML_UNUSED(num_inflight_batches);
#endif
}
static ggml_status ggml_backend_webgpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
@@ -2920,8 +3004,6 @@ static ggml_status ggml_backend_webgpu_graph_compute(ggml_backend_t backend, str
ggml_backend_webgpu_check_set_rows(ctx, num_inflight_batches);
}
ggml_backend_webgpu_wait_queue(ctx->global_ctx);
WEBGPU_CPU_PROFILE_TOTAL_END(graph_compute, ctx->global_ctx);
return GGML_STATUS_SUCCESS;
}
@@ -3941,6 +4023,10 @@ static bool ggml_backend_webgpu_device_supports_op(ggml_backend_dev_t dev, const
case GGML_OP_SSM_CONV:
supports_op = op->type == GGML_TYPE_F32;
break;
case GGML_OP_SSM_SCAN:
supports_op = op->type == GGML_TYPE_F32 &&
src0->ne[0] <= ctx->webgpu_global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup;
break;
case GGML_OP_GATED_DELTA_NET:
{
const uint32_t s_v = (uint32_t) src2->ne[0];

View File

@@ -0,0 +1,168 @@
#ifdef USE_SUBGROUP_REDUCTION
enable subgroups;
#endif
struct Params {
offset_s: u32,
offset_x: u32,
offset_dt: u32,
offset_A: u32,
offset_B: u32,
offset_C: u32,
offset_ids: u32,
offset_dst: u32,
stride_s1: u32,
stride_s2: u32,
stride_s3: u32,
stride_x1: u32,
stride_x2: u32,
stride_x3: u32,
stride_dt1: u32,
stride_dt2: u32,
a_ne0: u32,
stride_A1: u32,
stride_B1: u32,
stride_B2: u32,
stride_B3: u32,
stride_C1: u32,
stride_C2: u32,
stride_C3: u32,
d_state: u32,
d_inner: u32,
n_head: u32,
n_group: u32,
n_seq_tokens: u32,
n_seqs: u32,
y_elems: u32,
};
@group(0) @binding(0) var<storage, read_write> s_in: array<f32>;
@group(0) @binding(1) var<storage, read_write> x: array<f32>;
@group(0) @binding(2) var<storage, read_write> dt: array<f32>;
@group(0) @binding(3) var<storage, read_write> A: array<f32>;
@group(0) @binding(4) var<storage, read_write> B: array<f32>;
@group(0) @binding(5) var<storage, read_write> C: array<f32>;
@group(0) @binding(6) var<storage, read_write> ids: array<i32>;
@group(0) @binding(7) var<storage, read_write> dst: array<f32>;
@group(0) @binding(8) var<uniform> params: Params;
var<workgroup> shared_x_dt: array<f32, TOKENS_PER_TILE>;
var<workgroup> shared_dtsp: array<f32, TOKENS_PER_TILE>;
var<workgroup> shared_reduce: array<f32, TOKENS_PER_TILE * WG_SIZE>;
fn reduce_base(token_in_tile: u32) -> u32 {
return token_in_tile * WG_SIZE;
}
@compute @workgroup_size(WG_SIZE)
fn main(
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>
#ifdef USE_SUBGROUP_REDUCTION
, @builtin(subgroup_id) subgroup_id: u32,
@builtin(subgroup_invocation_id) subgroup_invocation_id: u32,
@builtin(num_subgroups) num_subgroups: u32
#endif
) {
let tid = local_id.x;
let wg_linear = wg_id.y * num_wg.x + wg_id.x;
let i1 = wg_linear % params.d_inner;
let head_seq = wg_linear / params.d_inner;
let ir = head_seq % params.n_head;
let i3 = head_seq / params.n_head;
let state_slot = u32(ids[params.offset_ids + i3]);
let g = ir / (params.n_head / params.n_group);
let s_idx = params.offset_s + tid + i1 * params.stride_s1 + ir * params.stride_s2 + state_slot * params.stride_s3;
var s_prev = s_in[s_idx];
let A0 = A[params.offset_A + (tid % params.a_ne0) + ir * params.stride_A1];
for (var token_base = 0u; token_base < params.n_seq_tokens; token_base += TOKENS_PER_TILE) {
if (tid < TOKENS_PER_TILE) {
let token = token_base + tid;
if (token < params.n_seq_tokens) {
let x_idx = params.offset_x + i1 + ir * params.stride_x1 + token * params.stride_x2 + i3 * params.stride_x3;
let dt_idx = params.offset_dt + ir + token * params.stride_dt1 + i3 * params.stride_dt2;
let dt0 = dt[dt_idx];
let dtsp = select(log(1.0 + exp(dt0)), dt0, dt0 > 20.0);
shared_dtsp[tid] = dtsp;
shared_x_dt[tid] = x[x_idx] * dtsp;
}
}
workgroupBarrier();
for (var token_in_tile = 0u; token_in_tile < TOKENS_PER_TILE; token_in_tile++) {
let token = token_base + token_in_tile;
if (token >= params.n_seq_tokens) {
break;
}
let x_dt = shared_x_dt[token_in_tile];
let dA = exp(shared_dtsp[token_in_tile] * A0);
let reduce_idx = reduce_base(token_in_tile) + tid;
let b_idx = params.offset_B + tid + g * params.stride_B1 + token * params.stride_B2 + i3 * params.stride_B3;
let c_idx = params.offset_C + tid + g * params.stride_C1 + token * params.stride_C2 + i3 * params.stride_C3;
let s = s_prev * dA + B[b_idx] * x_dt;
s_prev = s;
#ifdef USE_SUBGROUP_REDUCTION
let subgroup_partial = subgroupAdd(s * C[c_idx]);
if (subgroup_invocation_id == 0u) {
shared_reduce[reduce_idx - tid + subgroup_id] = subgroup_partial;
}
#else
shared_reduce[reduce_idx] = s * C[c_idx];
#endif
workgroupBarrier();
#ifdef USE_SUBGROUP_REDUCTION
if (tid == 0u) {
var sum = 0.0;
for (var sg = 0u; sg < num_subgroups; sg++) {
sum += shared_reduce[reduce_base(token_in_tile) + sg];
}
let y_idx =
params.offset_dst + i1 + ir * params.d_inner + token * (params.n_head * params.d_inner) +
i3 * (params.n_seq_tokens * params.n_head * params.d_inner);
dst[y_idx] = sum;
}
#else
for (var stride = WG_SIZE / 2u; stride > 0u; stride >>= 1u) {
if (tid < stride) {
shared_reduce[reduce_idx] += shared_reduce[reduce_idx + stride];
}
workgroupBarrier();
}
if (tid == 0u) {
let y_idx =
params.offset_dst + i1 + ir * params.d_inner + token * (params.n_head * params.d_inner) +
i3 * (params.n_seq_tokens * params.n_head * params.d_inner);
dst[y_idx] = shared_reduce[reduce_base(token_in_tile)];
}
#endif
workgroupBarrier();
}
}
let state_idx =
params.offset_dst + params.y_elems + tid + i1 * params.d_state + ir * (params.d_state * params.d_inner) +
i3 * (params.d_state * params.d_inner * params.n_head);
dst[state_idx] = s_prev;
}

View File

@@ -3,8 +3,12 @@
Test structured output capability via chat completions endpoint.
Each test case contains:
- response_format: OpenAI-compatible response_format specification
(json_schema only — llama.cpp does not support json_object)
- response_format: OpenAI-compatible response_format specification.
Both "json_schema" and "json_object" are accepted; with
"json_object" a schema can be supplied via extra_body.
- extra_body (optional): dict of extra top-level request fields merged into
the request payload (mirrors the OpenAI SDK's extra_body
feature; llama.cpp reads a top-level "json_schema" here).
- messages: initial conversation messages
- tools (optional): tool definitions (for mixed tool + structured tests)
- mock_tool_responses (optional): dict mapping tool_name -> callable(arguments) -> str (JSON)
@@ -81,11 +85,14 @@ def print_info(msg):
_print(f"{DIM}{msg}{RESET}")
def print_schema_note(label, rf):
def print_schema_note(label, rf, extra_body=None):
kind = rf.get("type", "?")
name = ""
if kind == "json_schema":
name = rf.get("json_schema", {}).get("name", "")
elif kind == "json_object" and extra_body and "json_schema" in extra_body:
extra_schema = extra_body["json_schema"] or {}
name = extra_schema.get("title") or "extra_body.json_schema"
_print(f"{DIM}{MAGENTA} ⟐ response_format [{label}]: {kind}"
f"{(' / ' + name) if name else ''}{RESET}")
@@ -95,17 +102,20 @@ def print_schema_note(label, rf):
# ---------------------------------------------------------------------------
def chat_completion(url, messages, tools=None, response_format=None, stream=False):
def chat_completion(url, messages, tools=None, response_format=None, stream=False,
extra_body=None):
payload = {
"messages": messages,
"stream": stream,
"max_tokens": 4096,
"max_tokens": 8192,
}
if tools:
payload["tools"] = tools
payload["tool_choice"] = "auto"
if response_format is not None:
payload["response_format"] = response_format
if extra_body:
payload.update(extra_body)
try:
response = requests.post(url, json=payload, stream=stream)
@@ -180,7 +190,7 @@ def chat_completion(url, messages, tools=None, response_format=None, stream=Fals
def run_tool_loop(
url, messages, tools, mock_tool_responses, stream, response_format=None,
max_turns=6,
extra_body=None, max_turns=6,
):
"""
Drive the tool-call loop. If response_format is provided it is applied to
@@ -191,7 +201,8 @@ def run_tool_loop(
for _ in range(max_turns):
result = chat_completion(
url, msgs, tools=tools, response_format=response_format, stream=stream
url, msgs, tools=tools, response_format=response_format, stream=stream,
extra_body=extra_body,
)
if result is None:
return all_tool_calls, msgs, None
@@ -274,7 +285,8 @@ def run_test(url, test_case, stream):
print_header(f"{name} [{mode}] ({apply_stage})")
response_format = test_case["response_format"]
print_schema_note(apply_stage, response_format)
extra_body = test_case.get("extra_body")
print_schema_note(apply_stage, response_format, extra_body)
tools = test_case.get("tools")
mocks = test_case.get("mock_tool_responses") or {}
@@ -290,6 +302,7 @@ def run_test(url, test_case, stream):
mock_tool_responses=mocks,
stream=stream,
response_format=response_format,
extra_body=extra_body,
)
elif apply_stage == "after_tools":
# Phase 1: plain tool loop, no response_format applied yet.
@@ -314,7 +327,8 @@ def run_test(url, test_case, stream):
# model focuses on producing the schema-constrained answer.
_print(f"\n{DIM}{MAGENTA} ⟐ follow-up turn with response_format applied{RESET}")
result = chat_completion(
url, msgs, tools=None, response_format=response_format, stream=stream
url, msgs, tools=None, response_format=response_format, stream=stream,
extra_body=extra_body,
)
final_content = result["content"] if result else None
else:
@@ -481,6 +495,51 @@ def _validate_sentiment(parsed):
return True, f"sentiment={parsed['sentiment']} conf={conf} kws={kws}"
# ---- Test: json_object + extra_body.json_schema (always) ----
#
# Exercises the llama.cpp-specific path where the OpenAI SDK would send
# response_format={"type": "json_object"} and tunnel the schema through
# extra_body.json_schema (which becomes a top-level "json_schema" field on
# the request body).
_PRODUCT_JSON_OBJECT_SCHEMA = {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://example.com/product.schema.json",
"title": "Product",
"description": "A product in the catalog",
"type": "object",
}
PRODUCT_JSON_OBJECT_TEST_CASE = {
"name": "json_object response_format with extra_body json_schema",
"response_format": {"type": "json_object"},
"extra_body": {"json_schema": _PRODUCT_JSON_OBJECT_SCHEMA},
"apply_stage": "always",
"messages": [
{
"role": "system",
"content": (
"Extract structured data from the provided text according to the "
"JSON schema. Return only valid JSON matching the schema exactly."
),
},
{
"role": "user",
"content": "Product: Wireless Headphones, ID: 101, In Stock: Yes",
},
],
"validate": lambda parsed, tcs, raw: _validate_product_json_object(parsed),
}
def _validate_product_json_object(parsed):
if not isinstance(parsed, dict):
return False, f"expected JSON object, got {type(parsed).__name__}: {parsed!r}"
if not parsed:
return False, f"expected non-empty object, got {parsed!r}"
return True, f"product object with {len(parsed)} field(s): {sorted(parsed.keys())}"
# ---- Test 3: Nested recipe schema (always) ----
_RECIPE_SCHEMA = {
@@ -915,6 +974,7 @@ def _validate_country_report(parsed, tcs):
ALL_TEST_CASES = [
BOOK_TEST_CASE,
SENTIMENT_TEST_CASE,
PRODUCT_JSON_OBJECT_TEST_CASE,
RECIPE_TEST_CASE,
SHOP_COMPARISON_TEST_CASE,
COUNTRY_REPORT_TEST_CASE,

View File

@@ -1283,7 +1283,7 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
llama_model_quantize_params llama_model_quantize_default_params() {
llama_model_quantize_params result = {
/*.nthread =*/ 0,
/*.ftype =*/ LLAMA_FTYPE_MOSTLY_Q5_1,
/*.ftype =*/ LLAMA_FTYPE_MOSTLY_Q8_0,
/*.output_tensor_type =*/ GGML_TYPE_COUNT,
/*.token_embedding_type =*/ GGML_TYPE_COUNT,
/*.allow_requantize =*/ false,

View File

@@ -947,7 +947,9 @@ json oaicompat_chat_params_parse(
json response_format = json_value(body, "response_format", json::object());
std::string response_type = json_value(response_format, "type", std::string());
if (response_type == "json_object") {
json_schema = json_value(response_format, "schema", json::object());
if (response_format.contains("schema") || json_schema.empty()) {
json_schema = json_value(response_format, "schema", json::object());
}
} else if (response_type == "json_schema") {
auto schema_wrapper = json_value(response_format, "json_schema", json::object());
json_schema = json_value(schema_wrapper, "schema", json::object());