model : fix llama_model::n_gpu_layers() (#24188 )

ui: run npm install when package-lock.json is newer than node_modules (#24171 )
Fix link to available UI settings (#24169 )
2026-06-06 02:52:58 +02:00 · 2026-06-05 17:11:42 +03:00 · 2026-06-05 14:57:32 +02:00 · 2026-06-05 14:39:32 +02:00 · 2026-06-05 14:31:03 +02:00 · 2026-06-05 12:21:26 +02:00
5 changed files with 81 additions and 40 deletions
--- a/scripts/ui-assets.cmake
+++ b/scripts/ui-assets.cmake
@@ -126,8 +126,22 @@ function(npm_build out_var)
        return()
    endif()

-    if(NOT EXISTS "${UI_SOURCE_DIR}/node_modules")
-        message(STATUS "UI: running npm install (first time)")
+    # npm writes node_modules/.package-lock.json on every successful install,
+    # so a package-lock.json newer than this marker means node_modules is stale
+    set(NPM_MARKER "${UI_SOURCE_DIR}/node_modules/.package-lock.json")
+    set(need_install FALSE)
+    if(NOT EXISTS "${NPM_MARKER}")
+        set(need_install TRUE)
+    else()
+        file(TIMESTAMP "${UI_SOURCE_DIR}/package-lock.json" lock_ts)
+        file(TIMESTAMP "${NPM_MARKER}" marker_ts)
+        if(lock_ts STRGREATER marker_ts)
+            set(need_install TRUE)
+        endif()
+    endif()
+
+    if(need_install)
+        message(STATUS "UI: running npm install")
        execute_process(
            COMMAND ${NPM_EXECUTABLE} install
            WORKING_DIRECTORY "${UI_SOURCE_DIR}"
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1636,7 +1636,8 @@ const float * llama_model::tensor_split() const {
 }

 uint32_t llama_model::n_gpu_layers() const {
-    return params.n_gpu_layers >= 0 ? params.n_gpu_layers : hparams.n_layer() + 1;
+    // note: plus 1 for the "output" layer
+    return params.n_gpu_layers >= 0 ? params.n_gpu_layers : hparams.n_layer_all + 1;
 }

 llama_split_mode llama_model::split_mode() const {
--- a/tools/quantize/README.md
+++ b/tools/quantize/README.md
@@ -5,62 +5,87 @@ Quantization reduces the precision of model weights (e.g., from 32-bit floats to
 This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
 This can be minimized by using a suitable imatrix file.

-You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
+You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup. It syncs from llama.cpp `main` every 6 hours.

-Note: It is synced from llama.cpp `main` every 6 hours.
+## Overview

-Example usage:
+Quantization is done in two phases:
+- Convert the original model to GGUF format.
+- Quantize the converted GGUF file.

-```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
+If the model supports multimodal inputs (images or audio), you also need to convert and quantize the multimodal encoders and projectors.
+
+To perform these tasks, you need to install the Python requirements:

 ```bash
-# from Hugginface, obtain the official meta-llama/Llama-3.1-8B model weights and place them in ./models
-ls ./models
-config.json             model-00001-of-00004.safetensors  model-00004-of-00004.safetensors  README.md                tokenizer.json
-generation_config.json  model-00002-of-00004.safetensors  model.safetensors.index.json      special_tokens_map.json  USE_POLICY.md
-LICENSE                 model-00003-of-00004.safetensors  original                          tokenizer_config.json
-
-# [Optional] for PyTorch .bin models like Mistral-7B
-ls ./models
-<folder containing weights and tokenizer json>
-
-# install Python dependencies
 python3 -m pip install -r requirements.txt
-
-# convert the model to ggml FP16 format
-python3 convert_hf_to_gguf.py ./models/mymodel/
-
-# quantize the model to 4-bits (using Q4_K_M method)
-./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
-
-# update the gguf filetype to current version if older version is now unsupported
-./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
 ```

-Run the quantized model:
+Or if you use `uv`:

 ```bash
-# start inference on a gguf model
-./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
+uv pip install -r requirements.txt --index-strategy unsafe-best-match
 ```

+## Prepare the input GGUF file
+
+To convert a model from a Hugging Face repo, you can use a command like the following:
+
+```
+python convert_hf_to_gguf.py --outfile gemma-4-E2B-it-bf16.gguf --outtype bf16 --remote google/gemma-4-E2B-it
+```
+
+Notes:
+- In the usual case where the model is distributed in 16-bit format, `--outtype auto` (or omitting `--outtype` entirely) also works well.
+- If you have previously downloaded the model locally, specify the directory and remove the `--remote` flag.
+- For compatibility reasons, the Python requirements install transformers 4, but more and more models (like Gemma 4) require transformers 5. You can safely `pip install -U transformers` to get the latest version.
+
+## Quantize the GGUF
+
+After you have created a high-quality GGUF version of the model, you use `llama-quantize` to apply quantization. For example, quantize to `Q4_K_M` using a command like the following:
+
+```bash
+./build/bin/llama-quantize gemma-4-E2B-it-bf16.gguf gemma-4-E2B-it-Q4_K_M.gguf Q4_K_M
+```
+
+Various quantization methods are described [later in this document](#quantize).
+
 Options:
-* `--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
-* `--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
-* `--pure` disables k-quant mixtures and quantizes all tensors to the same type
-* `--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
-* `--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
-* `--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
+* `--allow-requantize` allow requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
+* `--leave-output-tensor` leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
+* `--pure` disable k-quant mixtures and quantizes all tensors to the same type
+* `--imatrix file_name` use data in file_name as importance matrix for quant optimizations
+* `--include-weights tensor_name` use importance matrix for this tensor (can be specified multiple times)
+* `--exclude-weights tensor_name` use importance matrix for the tensors **not** specified (include/exclude cannot be mixed)
 * `--output-tensor-type` use a specific quant type for the output.weight tensor
 * `--token-embedding-type` use a specific quant type for the token embeddings tensor
-* `--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
+* `--keep-split` generate the quantized model in the same shards as the input file instead of a single quantized file

 Advanced options:
 * `--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
 * `--prune-layers` prune (remove) the layers in the list
-* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
+* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times.

-Examples:
+## (Optional) Convert the multimodal components
+
+llama.cpp will convert the LLM portion of the source model, which is enough for conversational applications. If the model accepts multimodal inputs and you wish to take advantage of them, you need to create a separate GGUF file. This file is generically known as `mmproj`, for "multimedia projector"; however, it may contain various components such as vision or audio encoders in addition to projections.
+
+Multimodal components are usually much smaller than the LLMs they come with. In addition, their quality has a direct impact on the quality of LLM generations, because these components are in charge of preparing the inputs for the LLM: the closer inputs are to data seen during training, the better LLM results will be.
+
+For these reasons, multimodal components are usually kept in a high-quality format such as bf16 or q8. The impact on speed and memory from using a smaller quant is negligible, but overall quality could be impacted.
+
+```bash
+python convert_hf_to_gguf.py --mmproj --outfile mmproj-gemma-4-E2B-it-Q8_0.gguf --outtype q8_0 --remote google/gemma-4-E2B-it
+```
+
+## Run the quantized model
+
+
+```bash
+./build/bin/llama cli -m ./gemma-4-E2B-it-Q4_K_M.gguf --mmproj ./mmproj-gemma-4-E2B-it-Q8_0.gguf --image <input_image> --prompt "Describe this image"
+```
+
+## Quantization Examples

 ```bash
 # naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"
--- a/tools/server/README.md
+++ b/tools/server/README.md
@@ -1870,4 +1870,4 @@ You can specify default preferences for the web UI using `--ui-config <JSON conf

 > **Note:** The old flags `--webui-config` and `--webui-config-file` are deprecated but still work as aliases.

-You may find available preferences in [settings-config.ts](../ui/src/lib/constants/settings-config.ts).
+You may find available preferences in [settings-keys.ts](../ui/src/lib/constants/settings-keys.ts).
--- a/tools/ui/.npmrc
+++ b/tools/ui/.npmrc
@@ -1 +1,2 @@
 engine-strict=true
+ignore-scripts=true
Author	SHA1	Message	Date
Georgi Gerganov	96fbe00393	model : fix llama_model::n_gpu_layers() (#24188 )	2026-06-05 17:11:42 +03:00
Pascal	2016bf2b3b	ui: run npm install when package-lock.json is newer than node_modules (#24171 )	2026-06-05 14:57:32 +02:00
Mario	9c955c48b0	Fix link to available UI settings (#24169 ) The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link	2026-06-05 14:39:32 +02:00
Xuan-Son Nguyen	cc7bef34e2	ui: add ignore-scripts=true to npmrc (#24149 )	2026-06-05 14:31:03 +02:00
Pedro Cuenca	ad1b88ca0d	docs: Update quantization readme (#24133 ) * Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-05 12:21:26 +02:00