Compare commits

...

5 Commits
b9524 ... b9529

Author SHA1 Message Date
Georgi Gerganov
96fbe00393 model : fix llama_model::n_gpu_layers() (#24188) 2026-06-05 17:11:42 +03:00
Pascal
2016bf2b3b ui: run npm install when package-lock.json is newer than node_modules (#24171) 2026-06-05 14:57:32 +02:00
Mario
9c955c48b0 Fix link to available UI settings (#24169)
The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link
2026-06-05 14:39:32 +02:00
Xuan-Son Nguyen
cc7bef34e2 ui: add ignore-scripts=true to npmrc (#24149) 2026-06-05 14:31:03 +02:00
Pedro Cuenca
ad1b88ca0d docs: Update quantization readme (#24133)
* Update quantization readme

* install requirements

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* dos2unix suggestions

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-05 12:21:26 +02:00
5 changed files with 81 additions and 40 deletions

View File

@@ -126,8 +126,22 @@ function(npm_build out_var)
return()
endif()
if(NOT EXISTS "${UI_SOURCE_DIR}/node_modules")
message(STATUS "UI: running npm install (first time)")
# npm writes node_modules/.package-lock.json on every successful install,
# so a package-lock.json newer than this marker means node_modules is stale
set(NPM_MARKER "${UI_SOURCE_DIR}/node_modules/.package-lock.json")
set(need_install FALSE)
if(NOT EXISTS "${NPM_MARKER}")
set(need_install TRUE)
else()
file(TIMESTAMP "${UI_SOURCE_DIR}/package-lock.json" lock_ts)
file(TIMESTAMP "${NPM_MARKER}" marker_ts)
if(lock_ts STRGREATER marker_ts)
set(need_install TRUE)
endif()
endif()
if(need_install)
message(STATUS "UI: running npm install")
execute_process(
COMMAND ${NPM_EXECUTABLE} install
WORKING_DIRECTORY "${UI_SOURCE_DIR}"

View File

@@ -1636,7 +1636,8 @@ const float * llama_model::tensor_split() const {
}
uint32_t llama_model::n_gpu_layers() const {
return params.n_gpu_layers >= 0 ? params.n_gpu_layers : hparams.n_layer() + 1;
// note: plus 1 for the "output" layer
return params.n_gpu_layers >= 0 ? params.n_gpu_layers : hparams.n_layer_all + 1;
}
llama_split_mode llama_model::split_mode() const {

View File

@@ -5,62 +5,87 @@ Quantization reduces the precision of model weights (e.g., from 32-bit floats to
This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [KullbackLeibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
This can be minimized by using a suitable imatrix file.
You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup. It syncs from llama.cpp `main` every 6 hours.
Note: It is synced from llama.cpp `main` every 6 hours.
## Overview
Example usage:
Quantization is done in two phases:
- Convert the original model to GGUF format.
- Quantize the converted GGUF file.
```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
If the model supports multimodal inputs (images or audio), you also need to convert and quantize the multimodal encoders and projectors.
To perform these tasks, you need to install the Python requirements:
```bash
# from Hugginface, obtain the official meta-llama/Llama-3.1-8B model weights and place them in ./models
ls ./models
config.json model-00001-of-00004.safetensors model-00004-of-00004.safetensors README.md tokenizer.json
generation_config.json model-00002-of-00004.safetensors model.safetensors.index.json special_tokens_map.json USE_POLICY.md
LICENSE model-00003-of-00004.safetensors original tokenizer_config.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>
# install Python dependencies
python3 -m pip install -r requirements.txt
# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py ./models/mymodel/
# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
```
Run the quantized model:
Or if you use `uv`:
```bash
# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
uv pip install -r requirements.txt --index-strategy unsafe-best-match
```
## Prepare the input GGUF file
To convert a model from a Hugging Face repo, you can use a command like the following:
```
python convert_hf_to_gguf.py --outfile gemma-4-E2B-it-bf16.gguf --outtype bf16 --remote google/gemma-4-E2B-it
```
Notes:
- In the usual case where the model is distributed in 16-bit format, `--outtype auto` (or omitting `--outtype` entirely) also works well.
- If you have previously downloaded the model locally, specify the directory and remove the `--remote` flag.
- For compatibility reasons, the Python requirements install transformers 4, but more and more models (like Gemma 4) require transformers 5. You can safely `pip install -U transformers` to get the latest version.
## Quantize the GGUF
After you have created a high-quality GGUF version of the model, you use `llama-quantize` to apply quantization. For example, quantize to `Q4_K_M` using a command like the following:
```bash
./build/bin/llama-quantize gemma-4-E2B-it-bf16.gguf gemma-4-E2B-it-Q4_K_M.gguf Q4_K_M
```
Various quantization methods are described [later in this document](#quantize).
Options:
* `--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
* `--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
* `--pure` disables k-quant mixtures and quantizes all tensors to the same type
* `--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
* `--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
* `--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
* `--allow-requantize` allow requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
* `--leave-output-tensor` leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
* `--pure` disable k-quant mixtures and quantizes all tensors to the same type
* `--imatrix file_name` use data in file_name as importance matrix for quant optimizations
* `--include-weights tensor_name` use importance matrix for this tensor (can be specified multiple times)
* `--exclude-weights tensor_name` use importance matrix for the tensors **not** specified (include/exclude cannot be mixed)
* `--output-tensor-type` use a specific quant type for the output.weight tensor
* `--token-embedding-type` use a specific quant type for the token embeddings tensor
* `--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
* `--keep-split` generate the quantized model in the same shards as the input file instead of a single quantized file
Advanced options:
* `--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
* `--prune-layers` prune (remove) the layers in the list
* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times.
Examples:
## (Optional) Convert the multimodal components
llama.cpp will convert the LLM portion of the source model, which is enough for conversational applications. If the model accepts multimodal inputs and you wish to take advantage of them, you need to create a separate GGUF file. This file is generically known as `mmproj`, for "multimedia projector"; however, it may contain various components such as vision or audio encoders in addition to projections.
Multimodal components are usually much smaller than the LLMs they come with. In addition, their quality has a direct impact on the quality of LLM generations, because these components are in charge of preparing the inputs for the LLM: the closer inputs are to data seen during training, the better LLM results will be.
For these reasons, multimodal components are usually kept in a high-quality format such as bf16 or q8. The impact on speed and memory from using a smaller quant is negligible, but overall quality could be impacted.
```bash
python convert_hf_to_gguf.py --mmproj --outfile mmproj-gemma-4-E2B-it-Q8_0.gguf --outtype q8_0 --remote google/gemma-4-E2B-it
```
## Run the quantized model
```bash
./build/bin/llama cli -m ./gemma-4-E2B-it-Q4_K_M.gguf --mmproj ./mmproj-gemma-4-E2B-it-Q8_0.gguf --image <input_image> --prompt "Describe this image"
```
## Quantization Examples
```bash
# naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"

View File

@@ -1870,4 +1870,4 @@ You can specify default preferences for the web UI using `--ui-config <JSON conf
> **Note:** The old flags `--webui-config` and `--webui-config-file` are deprecated but still work as aliases.
You may find available preferences in [settings-config.ts](../ui/src/lib/constants/settings-config.ts).
You may find available preferences in [settings-keys.ts](../ui/src/lib/constants/settings-keys.ts).

View File

@@ -1 +1,2 @@
engine-strict=true
ignore-scripts=true