CUDA: enroll mul_mat_vec_q_moe into pdl (#24087 )

* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels
ci : build-msys job slimming [no ci] (#24157 )
2026-06-06 02:52:58 +02:00 · 2026-06-05 08:37:34 +02:00 · 2026-06-05 07:57:36 +02:00 · 2026-06-05 08:10:31 +03:00 · 2026-06-04 19:30:59 +03:00 · 2026-06-04 19:23:48 +03:00
42 changed files with 3086 additions and 1487 deletions
--- a/.github/workflows/build-msys.yml
+++ b/.github/workflows/build-msys.yml
@@ -27,8 +27,8 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - { sys: UCRT64,  env: ucrt-x86_64,  build: Release }
-          - { sys: CLANG64, env: clang-x86_64, build: Release }
+          - { sys: UCRT64,  env: ucrt-x86_64,  compiler: gcc,   build: Release }
+          - { sys: CLANG64, env: clang-x86_64, compiler: clang, build: Release }

    steps:
      - name: Clone
@@ -48,9 +48,7 @@ jobs:
          update: true
          msystem: ${{matrix.sys}}
          install: >-
-            base-devel
-            git
-            mingw-w64-${{matrix.env}}-toolchain
+            mingw-w64-${{matrix.env}}-${{matrix.compiler}}
            mingw-w64-${{matrix.env}}-cmake
            mingw-w64-${{matrix.env}}-openblas

--- a/common/CMakeLists.txt
+++ b/common/CMakeLists.txt
@@ -78,6 +78,8 @@ add_library(${TARGET}
    hf-cache.cpp
    hf-cache.h
    http.h
+    imatrix-loader.cpp
+    imatrix-loader.h
    json-partial.cpp
    json-partial.h
    json-schema-to-grammar.cpp
--- a/common/arg.cpp
+++ b/common/arg.cpp
@@ -446,6 +446,12 @@ bool common_params_handle_models(common_params & params, llama_example curr_ex)
    opts.download_mtp    = spec_type_draft_mtp;
    opts.download_mmproj = !params.no_mmproj;

+    // sub-models (draft, mmproj, vocoder) are explicitly specified by the user,
+    // so we should not auto-discover mtp/mmproj siblings for them
+    common_download_opts sub_opts = opts;
+    sub_opts.download_mtp    = false;
+    sub_opts.download_mmproj = false;
+
    try {
        auto res = common_params_handle_model(params.model, opts);
        if (params.no_mmproj) {
@@ -457,7 +463,7 @@ bool common_params_handle_models(common_params & params, llama_example curr_ex)
        // only download mmproj if the current example is using it
        for (const auto & ex : mmproj_examples) {
            if (curr_ex == ex) {
-                common_params_handle_model(params.mmproj, opts);
+                common_params_handle_model(params.mmproj, sub_opts);
                break;
            }
        }
@@ -470,8 +476,8 @@ bool common_params_handle_models(common_params & params, llama_example curr_ex)
            params.speculative.draft.mparams.url.empty()) {
            params.speculative.draft.mparams.path = res.mtp.path;
        }
-        common_params_handle_model(params.speculative.draft.mparams, opts);
-        common_params_handle_model(params.vocoder.model,             opts);
+        common_params_handle_model(params.speculative.draft.mparams, sub_opts);
+        common_params_handle_model(params.vocoder.model,             sub_opts);
        return true;
    } catch (const common_skip_download_exception &) {
        return false;
--- a/common/imatrix-loader.cpp
+++ b/common/imatrix-loader.cpp
@@ -0,0 +1,165 @@
+#include "imatrix-loader.h"
+#include "common.h"
+#include "log.h"
+#include "gguf.h"
+
+#include <cmath>
+#include <cstring>
+#include <fstream>
+
+static bool common_imatrix_load_legacy(const std::string & fname, common_imatrix & imatrix) {
+    std::ifstream in(fname, std::ios::binary);
+    if (!in) {
+        LOG_ERR("%s: failed to open %s\n", __func__, fname.c_str());
+        return false;
+    }
+
+    int n_entries;
+    in.read((char *) &n_entries, sizeof(n_entries));
+    if (in.fail() || n_entries < 1) {
+        LOG_ERR("%s: no data in file %s\n", __func__, fname.c_str());
+        return false;
+    }
+
+    for (int i = 0; i < n_entries; ++i) {
+        int32_t len = 0;
+        in.read((char *) &len, sizeof(len));
+        std::vector<char> name_as_vec(len + 1);
+        in.read((char *) name_as_vec.data(), len);
+        if (in.fail()) {
+            LOG_ERR("%s: failed reading name for entry %d from %s\n", __func__, i + 1, fname.c_str());
+            return false;
+        }
+        name_as_vec[len] = 0;
+        std::string name{ name_as_vec.data() };
+
+        int32_t ncall = 0;
+        in.read((char *) &ncall, sizeof(ncall));
+        int32_t nval = 0;
+        in.read((char *) &nval, sizeof(nval));
+        if (in.fail() || nval < 1) {
+            LOG_ERR("%s: failed reading number of values for entry %d\n", __func__, i);
+            return false;
+        }
+
+        auto & e = imatrix.entries[std::move(name)];
+        e.sums.resize(nval);
+        in.read((char *) e.sums.data(), nval * sizeof(float));
+        if (in.fail()) {
+            LOG_ERR("%s: failed reading data for entry %d\n", __func__, i);
+            return false;
+        }
+
+        e.counts.resize(1);
+        e.counts[0] = ncall;
+    }
+
+    // the trailing data (chunk count + dataset name) is optional
+    if (in.peek() != EOF) {
+        int32_t n_calls = 0;
+        in.read((char *) &n_calls, sizeof(n_calls));
+        imatrix.chunk_count = n_calls;
+
+        if (!in.fail()) {
+            int32_t len = 0;
+            in.read((char *) &len, sizeof(len));
+            if (!in.fail() && len > 0) {
+                std::vector<char> dataset(len + 1, 0);
+                in.read(dataset.data(), len);
+                if (!in.fail()) {
+                    imatrix.datasets.push_back(dataset.data());
+                }
+            }
+        }
+    }
+
+    imatrix.chunk_size = 0;
+    imatrix.is_legacy  = true;
+
+    return true;
+}
+
+bool common_imatrix_load(const std::string & fname, common_imatrix & imatrix) {
+    struct ggml_context * ctx = nullptr;
+    struct gguf_init_params meta_gguf_params = {
+        /* .no_alloc = */ false,
+        /* .ctx      = */ &ctx,
+    };
+    struct gguf_context * ctx_gguf = gguf_init_from_file(fname.c_str(), meta_gguf_params);
+    if (!ctx_gguf) {
+        return common_imatrix_load_legacy(fname, imatrix);
+    }
+
+    const int32_t n_entries = gguf_get_n_tensors(ctx_gguf);
+    if (n_entries < 1) {
+        LOG_ERR("%s: no data in file %s\n", __func__, fname.c_str());
+        gguf_free(ctx_gguf);
+        ggml_free(ctx);
+        return false;
+    }
+
+    const int64_t datasets_key   = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_DATASETS);
+    const int64_t chunk_count_key = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_CHUNK_COUNT);
+    const int64_t chunk_size_key  = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_CHUNK_SIZE);
+
+    if (datasets_key != -1 && gguf_get_arr_type(ctx_gguf, datasets_key) == GGUF_TYPE_STRING) {
+        const int64_t n = gguf_get_arr_n(ctx_gguf, datasets_key);
+        imatrix.datasets.reserve(imatrix.datasets.size() + n);
+        for (int64_t i = 0; i < n; ++i) {
+            imatrix.datasets.push_back(gguf_get_arr_str(ctx_gguf, datasets_key, i));
+        }
+    }
+
+    imatrix.has_metadata = (datasets_key != -1 && chunk_count_key != -1 && chunk_size_key != -1);
+    imatrix.chunk_count  = (chunk_count_key != -1) ? gguf_get_val_u32(ctx_gguf, chunk_count_key) : 0;
+    imatrix.chunk_size   = (chunk_size_key  != -1) ? gguf_get_val_u32(ctx_gguf, chunk_size_key)  : 0;
+
+    const std::string in_sum2_suffix{ ".in_sum2" };
+    const std::string counts_suffix{ ".counts" };
+
+    std::map<std::string, std::pair<struct ggml_tensor *, struct ggml_tensor *>> sums_counts_for;
+
+    for (struct ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+        std::string name = cur->name;
+
+        if (name.empty()) { continue; }
+
+        if (string_remove_suffix(name, in_sum2_suffix)) {
+            sums_counts_for[std::move(name)].first = cur;
+        } else if (string_remove_suffix(name, counts_suffix)) {
+            sums_counts_for[std::move(name)].second = cur;
+        }
+    }
+
+    for (const auto & sc : sums_counts_for) {
+        const std::string &        name    = sc.first;
+        const struct ggml_tensor * in_sum2 = sc.second.first;
+        const struct ggml_tensor * counts  = sc.second.second;
+
+        if (!in_sum2 || !counts) {
+            LOG_ERR("%s: mismatched sums and counts for %s\n", __func__, name.c_str());
+            gguf_free(ctx_gguf);
+            ggml_free(ctx);
+            return false;
+        }
+
+        auto & e = imatrix.entries[name];
+
+        const int64_t nval    = ggml_nelements(in_sum2);
+        const int64_t ncounts = ggml_nelements(counts);
+
+        e.sums.resize(nval);
+        for (int64_t j = 0; j < nval; ++j) {
+            e.sums[j] = ((const float *) in_sum2->data)[j];
+        }
+
+        e.counts.resize(ncounts);
+        for (int64_t j = 0; j < ncounts; ++j) {
+            e.counts[j] = std::lround(((const float *) counts->data)[j]);
+        }
+    }
+
+    gguf_free(ctx_gguf);
+    ggml_free(ctx);
+    return true;
+}
--- a/common/imatrix-loader.h
+++ b/common/imatrix-loader.h
@@ -0,0 +1,26 @@
+#pragma once
+
+#include <cstdint>
+#include <map>
+#include <string>
+#include <vector>
+
+inline constexpr const char * LLM_KV_IMATRIX_DATASETS    = "imatrix.datasets";
+inline constexpr const char * LLM_KV_IMATRIX_CHUNK_COUNT = "imatrix.chunk_count";
+inline constexpr const char * LLM_KV_IMATRIX_CHUNK_SIZE  = "imatrix.chunk_size";
+
+struct common_imatrix_entry {
+    std::vector<float>   sums;
+    std::vector<int64_t> counts;
+};
+
+struct common_imatrix {
+    std::map<std::string, common_imatrix_entry> entries;
+    std::vector<std::string> datasets;
+    int32_t chunk_count    = 0;
+    int32_t chunk_size     = 0;
+    bool    is_legacy      = false;
+    bool    has_metadata   = false;
+};
+
+bool common_imatrix_load(const std::string & fname, common_imatrix & imatrix);
--- a/conversion/gemma.py
+++ b/conversion/gemma.py
@@ -798,7 +798,8 @@ class Gemma4VisionAudioModel(MmprojModel):
        # remap audio hparams
        if self.hparams_audio:
            self.hparams_audio["feat_in"] = self.hparams_audio.get("input_feat_size", 128)
-            self.hparams_audio["intermediate_size"] = self.hparams_audio["hidden_size"] * 4
+            if "hidden_size" in self.hparams_audio:
+                self.hparams_audio["intermediate_size"] = self.hparams_audio["hidden_size"] * 4
        else:
            self.has_audio_encoder = False

@@ -872,7 +873,7 @@ class Gemma4UnifiedVisionAudioModel(Gemma4VisionAudioModel):
        assert self.hparams_audio is not None
        text_embd_dim = self.hparams_vision["mm_embed_dim"]
        self.hparams_vision["hidden_size"] = text_embd_dim
-        self.hparams_audio["hidden_size"] = text_embd_dim
+        self.hparams_audio["hidden_size"] = self.hparams_audio["audio_embed_dim"]
        # this is a transformer-less vision tower, the params below are redundant but set to avoid error
        self.hparams_vision["intermediate_size"] = 0
        self.hparams_vision["num_layers"] = 0
@@ -897,7 +898,10 @@ class Gemma4UnifiedVisionAudioModel(Gemma4VisionAudioModel):
            # ggml im2col outputs in RR..GG..BB.. (CHW) order, but weight expects RGBRGB.. (HWC).
            # Permute columns so column i aligns with CHW input position i.
            assert self.hparams_vision is not None
-            p = self.hparams_vision["model_patch_size"]
+            if "model_patch_size" in self.hparams_vision:
+                p = self.hparams_vision["model_patch_size"]
+            else:
+                p = self.hparams_vision["patch_size"] * self.hparams_vision["pooling_kernel_size"]
            i = torch.arange(p * p * 3)
            ch  = i // (p * p)
            row = (i % (p * p)) // p
@@ -908,7 +912,10 @@ class Gemma4UnifiedVisionAudioModel(Gemma4VisionAudioModel):
        elif "patch_ln1.weight" in name or "patch_ln1.bias" in name:
            # same permutation for patch_ln1 as patch_dense to align with CHW input order
            assert self.hparams_vision is not None
-            p = self.hparams_vision["model_patch_size"]
+            if "model_patch_size" in self.hparams_vision:
+                p = self.hparams_vision["model_patch_size"]
+            else:
+                p = self.hparams_vision["patch_size"] * self.hparams_vision["pooling_kernel_size"]
            i = torch.arange(p * p * 3)
            ch  = i // (p * p)
            row = (i % (p * p)) // p
--- a/examples/speculative-simple/speculative-simple.cpp
+++ b/examples/speculative-simple/speculative-simple.cpp
@@ -175,7 +175,7 @@ int main(int argc, char ** argv) {
                    llama_memory_seq_pos_max(llama_get_memory(ctx_tgt), seq_id));

            if (use_ckpt_dft) {
-                ckpt.update_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                ckpt.update_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
            }

            // generate a new draft
@@ -196,12 +196,12 @@ int main(int argc, char ** argv) {
            // this allows us to restore the state if partial draft acceptance occurs
            if (!draft.empty()) {
                if (use_ckpt_tgt) {
-                    ckpt.update_tgt(ctx_tgt, seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                    ckpt.update_tgt(ctx_tgt, seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
                }
            }

            {
-                ckpt.load_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                ckpt.load_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                llama_memory_seq_rm(llama_get_memory(ctx_dft.get()), seq_id, ckpt.pos_max + 1, -1);
            }
@@ -261,13 +261,13 @@ int main(int argc, char ** argv) {
            draft = std::move(ids);

            {
-                ckpt.load_tgt(ctx_tgt, seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                ckpt.load_tgt(ctx_tgt, seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                llama_memory_seq_rm(llama_get_memory(ctx_tgt), seq_id, ckpt.pos_max + 1, -1);
            }

            {
-                ckpt.load_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                ckpt.load_dft(ctx_dft.get(), seq_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                llama_memory_seq_rm(llama_get_memory(ctx_dft.get()), seq_id, ckpt.pos_max + 1, -1);
            }
--- a/ggml/src/ggml-cuda/mmvq.cu
+++ b/ggml/src/ggml-cuda/mmvq.cu
@@ -682,12 +682,16 @@ static __global__ void mul_mat_vec_q(
 template <ggml_type type, int c_rows_per_block>
 __launch_bounds__(get_mmvq_mmid_max_batch_for_device<type>()*ggml_cuda_get_physical_warp_size(), 1)
 static __global__ void mul_mat_vec_q_moe(
-        const void * __restrict__ vx, const void * __restrict__ vy, const int32_t * __restrict__ ids,
-        float * __restrict__ dst,
+        const void * vx_ptr, const void * vy_ptr, const int32_t * ids_ptr,
+        float * dst_ptr,
        const uint32_t ncols_x, const uint3 nchannels_y, const uint32_t nrows_x,
        const uint32_t stride_row_x, const uint32_t stride_col_y, const uint32_t stride_col_dst,
        const uint32_t stride_channel_x, const uint32_t stride_channel_y, const uint32_t stride_channel_dst,
        const uint32_t ncols_dst, const uint32_t ids_stride) {
+    const void    * GGML_CUDA_RESTRICT vx  = vx_ptr;
+    const void    * GGML_CUDA_RESTRICT vy  = vy_ptr;
+    const int32_t * GGML_CUDA_RESTRICT ids = ids_ptr;
+    float         * GGML_CUDA_RESTRICT dst = dst_ptr;

    constexpr int qk  = ggml_cuda_type_traits<type>::qk;
    constexpr int qi  = ggml_cuda_type_traits<type>::qi;
@@ -707,6 +711,7 @@ static __global__ void mul_mat_vec_q_moe(
        return;
    }

+    ggml_cuda_pdl_sync();
    const uint32_t channel_x = ids[channel_dst + token_idx * ids_stride];
    const uint32_t channel_y = fastmodulo(channel_dst, nchannels_y);

@@ -726,6 +731,8 @@ static __global__ void mul_mat_vec_q_moe(
        }
    }

+    ggml_cuda_pdl_lc();
+
    // Warp-level reduction only - no shared memory needed
 #pragma unroll
    for (int i = 0; i < c_rows_per_block; ++i) {
@@ -794,8 +801,9 @@ static void mul_mat_vec_q_moe_launch(
    const int64_t nblocks_rows = (nrows_x + rows_per_block - 1) / rows_per_block;
    const dim3 block_nums(nblocks_rows, nchannels_dst);
    const dim3 block_dims(warp_size, ncols_dst);
+    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(block_nums, block_dims, 0, stream);

-    mul_mat_vec_q_moe<type, rows_per_block><<<block_nums, block_dims, 0, stream>>>(
+    ggml_cuda_kernel_launch(mul_mat_vec_q_moe<type, rows_per_block>, launch_params,
        vx, vy, ids, dst, ncols_x, nchannels_y, nrows_x,
        stride_row_x, stride_col_y, stride_col_dst,
        stride_channel_x, stride_channel_y, stride_channel_dst,
--- a/ggml/src/ggml-sycl/ggml-sycl.cpp
+++ b/ggml/src/ggml-sycl/ggml-sycl.cpp
@@ -3971,7 +3971,9 @@ static bool should_reorder_tensor(ggml_backend_sycl_context& ctx, const ggml_ten
    return !g_ggml_sycl_disable_optimize && //allow optimize, controlled by $GGML_SYCL_DISABLE_OPT
            ctx.opt_feature.reorder &&      //allow this device due to good perf, skip the devices with bad perf.
            dst->op == GGML_OP_MUL_MAT &&   //limit to some supported cases of Q4_0, to do for more cases.
-            dst->src[1]->ne[1]==1 && dst->src[1]->ne[2]==1 && dst->src[1]->ne[3]==1;
+            // ne[1] <= 8 so multi-column decode (spec / MTP verify) also bootstraps the reorder;
+            // all reorderable types have a _switch_ncols kernel.
+            dst->src[1]->ne[1] <= 8 && dst->src[1]->ne[2]==1 && dst->src[1]->ne[3]==1;
 }

 static void opt_for_reorder(ggml_backend_sycl_context * ctx, const ggml_tensor * src0, const ggml_tensor * /* src1 */,
--- a/ggml/src/ggml-sycl/mmvq.cpp
+++ b/ggml/src/ggml-sycl/mmvq.cpp
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2112,6 +2112,15 @@ llama_memory_i * llama_model::create_memory(const llama_memory_params & params,
                        filter = [n_main](int32_t il) { return (uint32_t)il >= n_main; };
                    }

+                    if (arch == LLM_ARCH_STEP35 && hparams.nextn_predict_layers > 0) {
+                        const uint32_t n_main = hparams.n_layer - hparams.nextn_predict_layers;
+                        if (params.ctx_type == LLAMA_CONTEXT_TYPE_MTP) {
+                            filter = [n_main](int32_t il) { return (uint32_t)il >= n_main; };
+                        } else {
+                            filter = [n_main](int32_t il) { return (uint32_t)il <  n_main; };
+                        }
+                    }
+
                    if (hparams.swa_type != LLAMA_SWA_TYPE_NONE) {
                        GGML_ASSERT(hparams.is_swa_any());

--- a/tools/imatrix/imatrix.cpp
+++ b/tools/imatrix/imatrix.cpp
@@ -1,5 +1,6 @@
 #include "arg.h"
 #include "common.h"
+#include "imatrix-loader.h"
 #include "log.h"
 #include "llama.h"
 #include "gguf.h"
@@ -34,10 +35,6 @@ static void print_usage(int, char ** argv) {
    LOG("\n");
 }

-static const char * const LLM_KV_IMATRIX_DATASETS    = "imatrix.datasets";
-static const char * const LLM_KV_IMATRIX_CHUNK_COUNT = "imatrix.chunk_count";
-static const char * const LLM_KV_IMATRIX_CHUNK_SIZE  = "imatrix.chunk_size";
-
 struct Stats {
    std::vector<float>   values;
    std::vector<int64_t> counts;
@@ -65,7 +62,6 @@ public:
    bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
    void save_imatrix_legacy(int32_t ncall = -1) const;
    void save_imatrix(int32_t n_chunk = -1) const;
-    bool load_imatrix_legacy(const char * fname);
    bool load_imatrix(const char * file_name);
    const std::unordered_map<std::string, Stats> & get_mstats() const { return m_stats; }
 private:
@@ -624,204 +620,63 @@ void IMatrixCollector::save_imatrix(int32_t n_chunk) const {
    ggml_free(ctx);
 }

-bool IMatrixCollector::load_imatrix_legacy(const char * fname) {
-    std::ifstream in(fname, std::ios::binary);
-    if (!in) {
-        LOG_ERR("%s: failed to open %s\n", __func__, fname);
-        return false;
-    }
-    int n_entries;
-    in.read((char *) &n_entries, sizeof(n_entries));
-    if (in.fail() || n_entries < 1) {
-        LOG_ERR("%s: no data in file %s\n", __func__, fname);
-        return false;
-    }
-    // Guess the chunk size because it's not stored in the file
-    const int32_t chunk_size = m_params.n_ctx / m_params.n_parallel;
-
-    for (int i = 0; i < n_entries; ++i) {
-        int32_t len = 0;
-        in.read((char *) &len, sizeof(len));
-        std::vector<char> name_as_vec(len + 1);
-        in.read((char *) name_as_vec.data(), len);
-        if (in.fail()) {
-            LOG_ERR("%s: failed reading name for entry %d from %s\n", __func__, i + 1, fname);
-            return false;
-        }
-        name_as_vec[len] = 0;
-        std::string name{ name_as_vec.data() };
-        auto & e = m_stats[std::move(name)];
-        int32_t ncall = 0;
-        in.read((char *) &ncall, sizeof(ncall));
-        int32_t nval = 0;
-        in.read((char *) &nval, sizeof(nval));
-        if (in.fail() || nval < 1) {
-            LOG_ERR("%s: failed reading number of values for entry %d\n", __func__, i);
-            m_stats = {};
-            return false;
-        }
-
-        if (e.values.empty()) {
-            e.values.resize(nval, 0.0f);
-            e.counts.resize(1, 0);
-        }
-
-        std::vector<float> tmp(nval);
-        in.read((char *) tmp.data(), nval * sizeof(float));
-        if (in.fail()) {
-            LOG_ERR("%s: failed reading data for entry %d\n", __func__, i);
-            m_stats = {};
-            return false;
-        }
-
-        // Recreate the state as expected by save_imatrix(), and correct for weighted sum.
-        for (int i = 0; i < nval; i++) {
-            e.values[i] += tmp[i] * chunk_size;
-        }
-        // The legacy format doesn't distinguish the counts for different experts
-        for (size_t j = 0; j < e.counts.size(); ++j) {
-            e.counts[j] += ncall * chunk_size;
-        }
-    }
-
-    {
-        // TODO: extract into its own method; this is also used by the GGUF-based format
-        // Calculate the last chunk count
-        int64_t max_count = 0;
-        for (const auto & stats : m_stats) {
-            for (int64_t count : stats.second.counts) {
-                if (count > max_count) {
-                    max_count = count;
-                }
-            }
-        }
-        m_last_chunk = max_count / (chunk_size);
-    }
-
-    {
-        // Read the number of calls the matrix was computed with
-        int32_t n_calls;
-        in.read((char *) &n_calls, sizeof(n_calls));
-        // ignore it because it's not important
-    }
-
-    // Read the dataset path to include it when writing to GGUF
-    if (!in.fail()){
-        int32_t len = 0;
-        in.read((char *) &len, sizeof(len));
-        if (!in.fail()) {
-            std::vector<char> dataset;
-            dataset.resize(len + 1, 0);
-            in.read(dataset.data(), len);
-            if (!in.fail()) {
-                m_datasets.push_back(dataset.data());
-            }
-        }
-    }
-
-    return true;
-}
-
-// Using GGUF as the file format, for greater extensibility
 bool IMatrixCollector::load_imatrix(const char * file_name) {
-    struct ggml_context * ctx = nullptr;
-    struct gguf_init_params meta_gguf_params = {
-        /* .no_alloc = */ false, // the data is needed
-        /* .ctx      = */ &ctx,
-    };
-    struct gguf_context * ctx_gguf = gguf_init_from_file(file_name, meta_gguf_params);
-    if (!ctx_gguf) {
-        return this->load_imatrix_legacy(file_name);
-    }
-    const int32_t n_entries = gguf_get_n_tensors(ctx_gguf);
-    if (n_entries < 1) {
-        LOG_ERR("%s: no data in file %s\n", __func__, file_name);
-        gguf_free(ctx_gguf);
-        ggml_free(ctx);
+    common_imatrix loaded;
+    if (!common_imatrix_load(file_name, loaded)) {
        return false;
    }

-    const int64_t datasets_key = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_DATASETS);
-    if (datasets_key != -1 && gguf_get_arr_type(ctx_gguf, datasets_key) == GGUF_TYPE_STRING) {
-        const int64_t n = gguf_get_arr_n(ctx_gguf, datasets_key);
-        m_datasets.reserve(m_datasets.size() + n);
-        for (int64_t i = 0; i < n; ++i) {
-            m_datasets.push_back(gguf_get_arr_str(ctx_gguf, datasets_key, i));
-        }
-    }
-
-    const std::string in_sum2_suffix{ ".in_sum2" };
-    const std::string counts_suffix{ ".counts" };
-
-    // Could re-use m_stats instead, but this allows
-    // checking for completeness of *each* loaded imatrix file
-    // and also makes it easier to re-use a similar implementation in quantize.cpp
-    // Using an ordered map to get a deterministic iteration order.
-    std::map<std::string, std::pair<struct ggml_tensor *, struct ggml_tensor *>> sums_counts_for;
-
-    for (struct ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
-        std::string name = cur->name;
-
-        if (name.empty()) { continue; }
-
-        if (string_remove_suffix(name, in_sum2_suffix)) {
-            // in_sum2
-            sums_counts_for[std::move(name)].first = cur;
-        } else if (string_remove_suffix(name, counts_suffix)) {
-            // counts
-            sums_counts_for[std::move(name)].second = cur;
-        } else {
-            // ignore other tensors
-        }
-    }
-
-    for (const auto & sc : sums_counts_for) {
-        const std::string &        name    = sc.first;
-        const struct ggml_tensor * in_sum2 = sc.second.first;
-        const struct ggml_tensor * counts  = sc.second.second;
-
-        if (!in_sum2 || !counts) {
-            LOG_ERR("%s: mismatched sums and counts for %s\n", __func__, name.c_str());
-            gguf_free(ctx_gguf);
-            ggml_free(ctx);
-            return false;
-        }
+    const int32_t chunk_size = m_params.n_ctx / m_params.n_parallel;
+    const bool is_legacy = loaded.is_legacy;

+    for (auto & [name, entry] : loaded.entries) {
        auto & e = m_stats[name];

-        int64_t nval = ggml_nelements(in_sum2);
-        if (e.values.empty()) {
-            e.values.resize(nval, 0.0f);
-        } else if ((size_t) nval != e.values.size()) {
-            LOG_ERR("%s: mismatched sums size for %s: %zu != %zu\n", __func__, name.c_str(), (size_t) nval, e.values.size());
-            gguf_free(ctx_gguf);
-            ggml_free(ctx);
-            return false;
-        }
+        if (is_legacy) {
+            // Legacy format: sums contain (raw_sum/raw_count)*ncall, counts contain {ncall}
+            // Reconstruct raw form by multiplying by chunk_size
+            if (e.values.empty()) {
+                e.values.resize(entry.sums.size(), 0.0f);
+                e.counts.resize(1, 0);
+            }
+            for (size_t j = 0; j < entry.sums.size(); ++j) {
+                e.values[j] += entry.sums[j] * chunk_size;
+            }
+            for (size_t j = 0; j < e.counts.size(); ++j) {
+                e.counts[j] += entry.counts[0] * chunk_size;
+            }
+        } else {
+            // GGUF format: raw sums and counts, accumulate directly
+            const int64_t nval    = entry.sums.size();
+            const int64_t ncounts = entry.counts.size();

-        int64_t ncounts = ggml_nelements(counts);
-        if (e.counts.empty()) {
-            e.counts.resize(ncounts, 0);
-        } else if (e.counts.size() == 1 && ncounts > 1) {
-            // broadcast, when loading an old imatrix
-            e.counts.resize(ncounts, e.counts[0]);
-        } else if ((size_t) ncounts != e.counts.size()) {
-            LOG_ERR("%s: mismatched counts size for %s: %zu != %zu\n", __func__, name.c_str(), (size_t) ncounts, e.counts.size());
-            gguf_free(ctx_gguf);
-            ggml_free(ctx);
-            return false;
-        }
+            if (e.values.empty()) {
+                e.values.resize(nval, 0.0f);
+            } else if ((size_t) nval != e.values.size()) {
+                LOG_ERR("%s: mismatched sums size for %s: %zu != %zu\n", __func__, name.c_str(), (size_t) nval, e.values.size());
+                return false;
+            }

-        // Recreate the state as expected by save_imatrix()
-        for (int64_t j = 0; j < nval; j++) {
-            e.values[j] += ((const float *) in_sum2->data)[j];
-        }
-        for (int64_t j = 0; j < ncounts; j++) {
-            e.counts[j] += std::lround(((const float *) counts->data)[j]);
+            if (e.counts.empty()) {
+                e.counts.resize(ncounts, 0);
+            } else if (e.counts.size() == 1 && ncounts > 1) {
+                e.counts.resize(ncounts, e.counts[0]);
+            } else if ((size_t) ncounts != e.counts.size()) {
+                LOG_ERR("%s: mismatched counts size for %s: %zu != %zu\n", __func__, name.c_str(), (size_t) ncounts, e.counts.size());
+                return false;
+            }
+
+            for (int64_t j = 0; j < nval; ++j) {
+                e.values[j] += entry.sums[j];
+            }
+            for (int64_t j = 0; j < ncounts; ++j) {
+                e.counts[j] += entry.counts[j];
+            }
        }
    }

-    // TODO: extract into its own method; this is also used by the legacy format
+    m_datasets.insert(m_datasets.end(), loaded.datasets.begin(), loaded.datasets.end());
+
    // Calculate the last chunk count
    int64_t max_count = 0;
    for (const auto & stats : m_stats) {
@@ -831,10 +686,8 @@ bool IMatrixCollector::load_imatrix(const char * file_name) {
            }
        }
    }
-    m_last_chunk = max_count / (m_params.n_ctx / m_params.n_parallel);
+    m_last_chunk = max_count / chunk_size;

-    gguf_free(ctx_gguf);
-    ggml_free(ctx);
    return true;
 }

@@ -1218,6 +1071,9 @@ int main(int argc, char ** argv) {
        return 1;
    }

+    // set_params before show_statistics so load_imatrix has valid n_ctx/n_parallel
+    g_collector.set_params(params);
+
    if (params.show_statistics) {
        if (!show_statistics(params)) {
            return 1;
--- a/tools/quantize/quantize.cpp
+++ b/tools/quantize/quantize.cpp
@@ -2,6 +2,7 @@

 #include "build-info.h"
 #include "common.h"
+#include "imatrix-loader.h"

 #include "gguf.h"

@@ -14,7 +15,6 @@
 #include <vector>
 #include <string>
 #include <unordered_map>
-#include <map>
 #include <fstream>
 #include <filesystem>

@@ -78,11 +78,6 @@ static const char * const LLM_KV_QUANTIZE_IMATRIX_DATASET    = "quantize.imatrix
 static const char * const LLM_KV_QUANTIZE_IMATRIX_N_ENTRIES  = "quantize.imatrix.entries_count";
 static const char * const LLM_KV_QUANTIZE_IMATRIX_N_CHUNKS   = "quantize.imatrix.chunks_count";

-// TODO: share with imatrix.cpp
-static const char * const LLM_KV_IMATRIX_DATASETS    = "imatrix.datasets";
-static const char * const LLM_KV_IMATRIX_CHUNK_COUNT = "imatrix.chunk_count";
-static const char * const LLM_KV_IMATRIX_CHUNK_SIZE  = "imatrix.chunk_size";
-
 static bool striequals(const char * a, const char * b) {
    while (*a && *b) {
        if (std::tolower(*a) != std::tolower(*b)) {
@@ -181,184 +176,84 @@ static void usage(const char * executable) {
    exit(1);
 }

-static int load_legacy_imatrix(const std::string & imatrix_file, std::vector<std::string> & imatrix_datasets, std::unordered_map<std::string, std::vector<float>> & imatrix_data) {
-    std::ifstream in(imatrix_file.c_str(), std::ios::binary);
-    if (!in) {
-        printf("%s: failed to open %s\n",__func__, imatrix_file.c_str());
-        exit(1);
-    }
-    int n_entries;
-    in.read((char *)&n_entries, sizeof(n_entries));
-    if (in.fail() || n_entries < 1) {
-        printf("%s: no data in file %s\n", __func__, imatrix_file.c_str());
-        exit(1);
-    }
-    for (int i = 0; i < n_entries; ++i) {
-        int len; in.read((char *)&len, sizeof(len));
-        std::vector<char> name_as_vec(len+1);
-        in.read((char *)name_as_vec.data(), len);
-        if (in.fail()) {
-            printf("%s: failed reading name for entry %d from %s\n", __func__, i+1, imatrix_file.c_str());
-            exit(1);
-        }
-        name_as_vec[len] = 0;
-        std::string name{name_as_vec.data()};
-        auto & e = imatrix_data[name];
-        int ncall;
-        in.read((char *)&ncall, sizeof(ncall));
-        int nval;
-        in.read((char *)&nval, sizeof(nval));
-        if (in.fail() || nval < 1) {
-            printf("%s: failed reading number of values for entry %d\n", __func__, i);
-            imatrix_data = {};
-            exit(1);
-        }
-        e.resize(nval);
-        in.read((char *)e.data(), nval*sizeof(float));
-        if (in.fail()) {
-            printf("%s: failed reading data for entry %d\n", __func__, i);
-            imatrix_data = {};
-            exit(1);
-        }
-        if (ncall > 0) {
-            for (auto & v : e) {
-                v /= ncall;
-            }
-        }
-
-        if (getenv("LLAMA_TRACE")) {
-            printf("%s: loaded data (size = %6d, ncall = %6d) for '%s'\n", __func__, int(e.size()), ncall, name.c_str());
-        }
-    }
-
-    // latest legacy imatrix version contains the dataset filename at the end of the file
-    int m_last_call = 0;
-    if (in.peek() != EOF) {
-        in.read((char *)&m_last_call, sizeof(m_last_call));
-        int dataset_len;
-        in.read((char *)&dataset_len, sizeof(dataset_len));
-        std::vector<char> dataset_as_vec(dataset_len);
-        in.read(dataset_as_vec.data(), dataset_len);
-        imatrix_datasets.resize(1);
-        imatrix_datasets[0].assign(dataset_as_vec.begin(), dataset_as_vec.end());
-        printf("%s: imatrix dataset='%s'\n", __func__, imatrix_datasets[0].c_str());
-    }
-    printf("%s: loaded %d importance matrix entries from %s computed on %d chunks\n", __func__, int(imatrix_data.size()), imatrix_file.c_str(), m_last_call);
-    return m_last_call;
-}
-
 static int load_imatrix(const std::string & imatrix_file, std::vector<std::string> & imatrix_datasets, std::unordered_map<std::string, std::vector<float>> & imatrix_data) {
-
-    struct ggml_context * ctx = nullptr;
-    struct gguf_init_params meta_gguf_params = {
-        /* .no_alloc = */ false, // the data is needed
-        /* .ctx      = */ &ctx,
-    };
-    struct gguf_context * ctx_gguf = gguf_init_from_file(imatrix_file.c_str(), meta_gguf_params);
-    if (!ctx_gguf) {
-        fprintf(stderr, "%s: imatrix file '%s' is using old format\n", __func__, imatrix_file.c_str());
-        return load_legacy_imatrix(imatrix_file, imatrix_datasets, imatrix_data);
-    }
-    const int32_t n_entries = gguf_get_n_tensors(ctx_gguf);
-    if (n_entries < 1) {
-        fprintf(stderr, "%s: no data in file %s\n", __func__, imatrix_file.c_str());
-        gguf_free(ctx_gguf);
-        ggml_free(ctx);
+    common_imatrix loaded;
+    if (!common_imatrix_load(imatrix_file, loaded)) {
+        fprintf(stderr, "%s: failed to load imatrix from '%s'\n", __func__, imatrix_file.c_str());
        exit(1);
    }

-    const int dataset_idx     = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_DATASETS);
-    const int chunk_count_idx = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_CHUNK_COUNT);
-    const int chunk_size_idx  = gguf_find_key(ctx_gguf, LLM_KV_IMATRIX_CHUNK_SIZE);
-    if (dataset_idx < 0 || chunk_count_idx < 0 || chunk_size_idx < 0) {
+    if (!loaded.is_legacy && !loaded.has_metadata) {
        fprintf(stderr, "%s: missing imatrix metadata in file %s\n", __func__, imatrix_file.c_str());
-        gguf_free(ctx_gguf);
-        ggml_free(ctx);
        exit(1);
    }

-    const uint32_t chunk_size = gguf_get_val_u32(ctx_gguf, chunk_size_idx);
-
-    const std::string sums_suffix{ ".in_sum2" };
-    const std::string counts_suffix{ ".counts" };
-
-    // Using an ordered map to get a deterministic iteration order.
-    std::map<std::string, std::pair<struct ggml_tensor *, struct ggml_tensor *>> sums_counts_for;
-
-    for (struct ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
-        std::string name = cur->name;
-
-        if (name.empty()) { continue; }
-
-        if (string_remove_suffix(name, sums_suffix)) {
-            // in_sum2
-            sums_counts_for[std::move(name)].first = cur;
-        } else if (string_remove_suffix(name, counts_suffix)) {
-            // counts
-            sums_counts_for[std::move(name)].second = cur;
-        } else {
-            // ignore other tensors
-        }
-    }
-
-    for (const auto & sc : sums_counts_for) {
-        const        std::string & name   = sc.first;
-        const struct ggml_tensor * sums   = sc.second.first;
-        const struct ggml_tensor * counts = sc.second.second;
-
-        if (!sums || !counts) {
-            fprintf(stderr, "%s: mismatched sums and counts for %s\n", __func__, name.c_str());
-            gguf_free(ctx_gguf);
-            ggml_free(ctx);
-            exit(1);
-        }
-
-        const int64_t ne0 = sums->ne[0];
-        const int64_t ne1 = sums->ne[1];
-
+    for (const auto & [name, entry] : loaded.entries) {
        auto & e = imatrix_data[name];
-        e.resize(ggml_nelements(sums));
-        float max_count = 0.0f;
-        for (int64_t j = 0; j < ne1; ++j) {
-            const float count = ((const float *) counts->data)[j];
-            if (count > 0.0f) {
-                for (int64_t i = 0; i < ne0; ++i) {
-                    e[j*ne0 + i] = ((const float *) sums->data)[j*ne0 + i] / count;
+        e.resize(entry.sums.size());
+
+        if (!loaded.is_legacy) {
+            // GGUF format: normalize by per-expert counts
+            const int64_t ncounts = entry.counts.size();
+            const int64_t ne0     = (int64_t) entry.sums.size() / ncounts;
+
+            for (int64_t j = 0; j < ncounts; ++j) {
+                const float count = (float) entry.counts[j];
+                if (count > 0.0f) {
+                    for (int64_t i = 0; i < ne0; ++i) {
+                        e[j*ne0 + i] = entry.sums[j*ne0 + i] / count;
+                    }
+                } else {
+                    for (int64_t i = 0; i < ne0; ++i) {
+                        e[j*ne0 + i] = 1;
+                    }
+                }
+            }
+
+            if (getenv("LLAMA_TRACE")) {
+                float max_count = 0.0f;
+                for (int64_t j = 0; j < ncounts; ++j) {
+                    const float count = (float) entry.counts[j];
+                    if (count > max_count) {
+                        max_count = count;
+                    }
+                }
+                printf("%s: loaded data (size = %6d, n_tokens = %6d, n_chunks = %6d) for '%s'\n",
+                       __func__, int(e.size()), int(max_count), int(max_count / loaded.chunk_size), name.c_str());
+            }
+        } else {
+            // Legacy format: sums contain (raw/count)*ncall, divide by ncall
+            const int64_t ncall = entry.counts.empty() ? 0 : entry.counts[0];
+            if (ncall > 0) {
+                for (size_t i = 0; i < entry.sums.size(); ++i) {
+                    e[i] = entry.sums[i] / ncall;
                }
            } else {
-                // Partial imatrix data, this tensor never got any input during calibration
-                for (int64_t i = 0; i < ne0; ++i) {
-                    e[j*ne0 + i] = 1;
+                for (size_t i = 0; i < entry.sums.size(); ++i) {
+                    e[i] = entry.sums[i];
                }
            }
-            if (count > max_count) {
-                max_count = count;
+
+            if (getenv("LLAMA_TRACE")) {
+                printf("%s: loaded data (size = %6d, ncall = %6d) for '%s'\n",
+                       __func__, int(e.size()), int(ncall), name.c_str());
            }
        }
-        if (getenv("LLAMA_TRACE")) {
-            printf("%s: loaded data (size = %6d, n_tokens = %6d, n_chunks = %6d) for '%s'\n", __func__, int(e.size()), int(max_count), int(max_count / chunk_size), name.c_str());
+    }
+
+    imatrix_datasets = std::move(loaded.datasets);
+
+    if (!imatrix_datasets.empty()) {
+        printf("%s: imatrix datasets=['%s'", __func__, imatrix_datasets[0].c_str());
+        for (size_t i = 1; i < imatrix_datasets.size(); ++i) {
+            printf(", '%s'", imatrix_datasets[i].c_str());
        }
+        printf("]\n");
    }

-    int m_last_chunk = gguf_get_val_u32(ctx_gguf, chunk_count_idx);
+    printf("%s: loaded %d importance matrix entries from %s computed on %d chunks\n", __func__, int(imatrix_data.size()), imatrix_file.c_str(), loaded.chunk_count);

-    int64_t n_datasets = gguf_get_arr_n(ctx_gguf, dataset_idx);
-    imatrix_datasets.reserve(n_datasets);
-    for (int64_t i = 0; i < n_datasets; ++i) {
-        imatrix_datasets.push_back(gguf_get_arr_str(ctx_gguf, dataset_idx, i));
-    }
-    printf("%s: imatrix datasets=['%s'", __func__, imatrix_datasets[0].c_str());
-    for (size_t i = 1; i < imatrix_datasets.size(); ++i) {
-        printf(", '%s'", imatrix_datasets[i].c_str());
-    }
-    printf("]\n");
-
-    printf("%s: loaded %d importance matrix entries from %s computed on %d chunks\n", __func__, int(imatrix_data.size()), imatrix_file.c_str(), m_last_chunk);
-
-    gguf_free(ctx_gguf);
-    ggml_free(ctx);
-
-    return m_last_chunk;
+    return loaded.chunk_count;
 }

 static int prepare_imatrix(const std::string & imatrix_file,
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2512,7 +2512,7 @@ private:
                                llama_memory_seq_pos_max(llama_get_memory(ctx_tgt), slot.id));

                        if (use_ckpt_dft) {
-                            slot.spec_ckpt.update_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                            slot.spec_ckpt.update_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
                        }

                        slot.spec_prompt = slot.prompt.tokens.get_text_tokens();
@@ -2551,7 +2551,7 @@ private:

            if (ctx_dft) {
                if (use_ckpt_dft) {
-                    ckpt.load_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                    ckpt.load_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
                }

                common_context_seq_rm(ctx_dft.get(), slot.id, ckpt.pos_max + 1, -1);
@@ -2568,7 +2568,7 @@ private:
                if (use_ckpt_tgt) {
                    //const int64_t t_start = ggml_time_us();

-                    ckpt.update_tgt(ctx_tgt, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                    ckpt.update_tgt(ctx_tgt, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                    //const int64_t t_total = ggml_time_us() - t_start;
                    //printf("checkpoint total: %f ms\n", t_total / 1000.0);
@@ -2580,7 +2580,7 @@ private:
                }

                if (use_ckpt_dft) {
-                    ckpt.update_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                    ckpt.update_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
                }
            }
        }
@@ -3447,13 +3447,13 @@ private:
                            SLT_DBG(slot, "restoring speculative checkpoint (pos_min = %d, pos_max = %d, size = %zu)\n", ckpt.pos_min, ckpt.pos_max, ckpt.size());

                            {
-                                ckpt.load_tgt(slot.ctx_tgt, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                                ckpt.load_tgt(slot.ctx_tgt, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                                common_context_seq_rm(slot.ctx_tgt, slot.id, ckpt.pos_max + 1, -1);
                            }

                            if (slot.ctx_dft) {
-                                ckpt.load_dft(slot.ctx_dft, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
+                                ckpt.load_dft(slot.ctx_dft, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);

                                common_context_seq_rm(slot.ctx_dft, slot.id, ckpt.pos_max + 1, -1);
                            }
--- a/tools/ui/package-lock.json
+++ b/tools/ui/package-lock.json
--- a/tools/ui/package.json
+++ b/tools/ui/package.json
@@ -23,75 +23,77 @@
 		"cleanup": "rm -rf .svelte-kit build node_modules test-results"
 	},
 	"devDependencies": {
-		"@chromatic-com/storybook": "^5.0.0",
-		"@eslint/compat": "^1.2.5",
-		"@eslint/js": "^9.18.0",
-		"@internationalized/date": "^3.10.1",
-		"@lucide/svelte": "^0.515.0",
-		"@playwright/test": "^1.49.1",
-		"@storybook/addon-a11y": "^10.2.4",
-		"@storybook/addon-docs": "^10.2.4",
-		"@storybook/addon-svelte-csf": "^5.0.10",
-		"@storybook/addon-vitest": "^10.2.4",
-		"@storybook/sveltekit": "^10.2.4",
-		"@sveltejs/adapter-static": "^3.0.10",
-		"@sveltejs/kit": "^2.48.4",
-		"@sveltejs/vite-plugin-svelte": "^6.2.1",
-		"@tailwindcss/forms": "^0.5.9",
-		"@tailwindcss/typography": "^0.5.15",
-		"@tailwindcss/vite": "^4.0.0",
+		"@chromatic-com/storybook": "5.0.0",
+		"@eslint/compat": "1.4.1",
+		"@eslint/js": "9.39.2",
+		"@internationalized/date": "3.10.1",
+		"@lucide/svelte": "0.515.0",
+		"@modelcontextprotocol/sdk": "1.26.0",
+		"@playwright/test": "1.56.1",
+		"@storybook/addon-a11y": "10.2.4",
+		"@storybook/addon-docs": "10.2.4",
+		"@storybook/addon-svelte-csf": "5.0.10",
+		"@storybook/addon-vitest": "10.2.4",
+		"@storybook/sveltekit": "10.2.4",
+		"@sveltejs/adapter-static": "3.0.10",
+		"@sveltejs/kit": "2.60.1",
+		"@sveltejs/vite-plugin-svelte": "6.2.1",
+		"@tailwindcss/forms": "0.5.10",
+		"@tailwindcss/typography": "0.5.16",
+		"@tailwindcss/vite": "4.1.11",
 		"@types/node": "^24",
-		"@vitest/browser": "^3.2.3",
-		"@vitest/coverage-v8": "^3.2.3",
-		"bits-ui": "^2.14.4",
-		"clsx": "^2.1.1",
-		"dexie": "^4.0.11",
-		"eslint": "^9.18.0",
-		"eslint-config-prettier": "^10.0.1",
-		"eslint-plugin-storybook": "^10.2.4",
-		"eslint-plugin-svelte": "^3.0.0",
-		"globals": "^16.0.0",
-		"http-server": "^14.1.1",
-		"mdast": "^3.0.0",
-		"mdsvex": "^0.12.3",
-		"playwright": "^1.56.1",
-		"prettier": "^3.4.2",
-		"prettier-plugin-svelte": "^3.3.3",
-		"prettier-plugin-tailwindcss": "^0.6.11",
-		"rehype-katex": "^7.0.1",
-		"remark-math": "^6.0.0",
-		"sass": "^1.93.3",
-		"storybook": "^10.2.4",
-		"svelte": "^5.38.2",
-		"svelte-check": "^4.0.0",
-		"tailwind-merge": "^3.3.1",
-		"tailwind-variants": "^3.2.2",
-		"tailwindcss": "^4.0.0",
-		"tw-animate-css": "^1.3.5",
-		"typescript": "^5.0.0",
-		"typescript-eslint": "^8.20.0",
-		"unified": "^11.0.5",
-		"uuid": "^13.0.0",
-		"vite": "^7.2.2",
-		"vite-plugin-devtools-json": "^0.2.0",
-		"vitest": "^3.2.3",
-		"vitest-browser-svelte": "^0.1.0"
+		"@vitest/browser": "4.1.8",
+		"@vitest/browser-playwright": "4.1.8",
+		"@vitest/coverage-v8": "4.1.8",
+		"bits-ui": "2.18.1",
+		"clsx": "2.1.1",
+		"dexie": "4.0.11",
+		"eslint": "9.39.2",
+		"eslint-config-prettier": "10.1.8",
+		"eslint-plugin-storybook": "10.2.4",
+		"eslint-plugin-svelte": "3.15.0",
+		"globals": "16.3.0",
+		"highlight.js": "11.11.1",
+		"http-server": "14.1.1",
+		"mdast": "3.0.0",
+		"mdsvex": "0.12.6",
+		"mermaid": "11.15.0",
+		"mode-watcher": "1.1.0",
+		"pdfjs-dist": "5.4.54",
+		"playwright": "1.56.1",
+		"prettier": "3.6.2",
+		"prettier-plugin-svelte": "3.4.0",
+		"prettier-plugin-tailwindcss": "0.6.14",
+		"rehype-highlight": "7.0.2",
+		"rehype-katex": "7.0.1",
+		"rehype-stringify": "10.0.1",
+		"remark": "15.0.1",
+		"remark-breaks": "4.0.0",
+		"remark-gfm": "4.0.1",
+		"remark-html": "16.0.1",
+		"remark-math": "6.0.0",
+		"remark-rehype": "11.1.2",
+		"sass": "1.93.3",
+		"storybook": "10.3.3",
+		"svelte": "5.55.7",
+		"svelte-check": "4.3.0",
+		"svelte-sonner": "1.0.5",
+		"tailwind-merge": "3.3.1",
+		"tailwind-variants": "3.2.2",
+		"tailwindcss": "4.1.11",
+		"tw-animate-css": "1.3.5",
+		"typescript": "5.8.3",
+		"typescript-eslint": "8.56.0",
+		"unified": "11.0.5",
+		"unist-util-visit": "5.0.0",
+		"uuid": "13.0.2",
+		"vite": "7.3.2",
+		"vite-plugin-devtools-json": "0.2.1",
+		"vitest": "4.1.8",
+		"vitest-browser-svelte": "2.1.1",
+		"zod": "4.2.1"
 	},
-	"dependencies": {
-		"@modelcontextprotocol/sdk": "^1.25.1",
-		"highlight.js": "^11.11.1",
-		"mermaid": "^11.15.0",
-		"mode-watcher": "^1.1.0",
-		"pdfjs-dist": "^5.4.54",
-		"rehype-highlight": "^7.0.2",
-		"rehype-stringify": "^10.0.1",
-		"remark": "^15.0.1",
-		"remark-breaks": "^4.0.0",
-		"remark-gfm": "^4.0.1",
-		"remark-html": "^16.0.1",
-		"remark-rehype": "^11.1.2",
-		"svelte-sonner": "^1.0.5",
-		"unist-util-visit": "^5.0.0",
-		"zod": "^4.2.1"
+	"overrides": {
+		"cookie": "1.1.1"
 	}
 }
--- a/tools/ui/src/lib/components/app/actions/ActionIcon.svelte
+++ b/tools/ui/src/lib/components/app/actions/ActionIcon.svelte
@@ -35,23 +35,27 @@

 <Tooltip.Root>
 	<Tooltip.Trigger>
-		<Button
-			{variant}
-			{size}
-			{disabled}
-			onclick={(e: MouseEvent) => {
-				if (stopPropagationOnClick) e.stopPropagation();
+		<!-- prevent another nested button element -->
+		{#snippet child({ props })}
+			<Button
+				{...props}
+				{variant}
+				{size}
+				{disabled}
+				onclick={(e: MouseEvent) => {
+					if (stopPropagationOnClick) e.stopPropagation();

-				onclick?.(e);
-			}}
-			class="h-6 w-6 p-0 {className} flex hover:bg-transparent data-[state=open]:bg-transparent!"
-			aria-label={ariaLabel || tooltip}
-		>
-			{#if icon}
-				{@const IconComponent = icon}
-				<IconComponent class={iconSize} />
-			{/if}
-		</Button>
+					onclick?.(e);
+				}}
+				class="h-6 w-6 p-0 {className} flex hover:bg-transparent data-[state=open]:bg-transparent!"
+				aria-label={ariaLabel || tooltip}
+			>
+				{#if icon}
+					{@const IconComponent = icon}
+					<IconComponent class={iconSize} />
+				{/if}
+			</Button>
+		{/snippet}
 	</Tooltip.Trigger>

 	<Tooltip.Content side={tooltipSide}>
--- a/tools/ui/src/lib/components/app/badges/BadgeInfo.svelte
+++ b/tools/ui/src/lib/components/app/badges/BadgeInfo.svelte
@@ -1,22 +1,22 @@
 <script lang="ts">
 	import type { Snippet } from 'svelte';
+	import type { HTMLButtonAttributes } from 'svelte/elements';

-	interface Props {
+	interface Props extends HTMLButtonAttributes {
 		children: Snippet;
 		class?: string;
 		icon?: Snippet;
-		onclick?: () => void;
 	}

-	let { children, class: className = '', icon, onclick }: Props = $props();
+	let { children, class: className = '', icon, ...rest }: Props = $props();
 </script>

 <button
+	{...rest}
 	class={[
 		'inline-flex cursor-pointer items-center gap-1 rounded-sm bg-muted-foreground/15 px-1.5 py-0.75',
 		className
 	]}
-	{onclick}
 >
 	{#if icon}
 		{@render icon()}
--- a/tools/ui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailFile.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailFile.svelte
@@ -97,7 +97,9 @@
 {/snippet}

 {#snippet removeButton()}
-	<div class="absolute top-2 right-2 opacity-0 transition-opacity group-hover:opacity-100">
+	<div
+		class="absolute top-2 right-2 opacity-0 transition-opacity group-focus-within:opacity-100 group-hover:opacity-100"
+	>
 		<ActionIcon icon={X} tooltip="Remove" stopPropagationOnClick onclick={() => onRemove?.(id)} />
 	</div>
 {/snippet}
--- a/tools/ui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailImage.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailImage.svelte
@@ -51,7 +51,7 @@

 	{#if !readonly}
 		<div
-			class="absolute top-1 right-1 flex items-center justify-center opacity-0 transition-opacity group-hover:opacity-100"
+			class="absolute top-1 right-1 flex items-center justify-center opacity-0 transition-opacity group-focus-within:opacity-100 group-hover:opacity-100"
 		>
 			<ActionIcon
 				class="text-white"
--- a/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageAgenticContent.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageAgenticContent.svelte
@@ -31,7 +31,8 @@
 		agenticPendingPermissionRequest,
 		agenticResolvePermission,
 		agenticPendingContinueRequest,
-		agenticResolveContinue
+		agenticResolveContinue,
+		agenticLastError
 	} from '$lib/stores/agentic.svelte';
 	import { config } from '$lib/stores/settings.svelte';

@@ -56,6 +57,10 @@
 	const showToolCallInProgress = $derived(config().showToolCallInProgress as boolean);
 	const showThoughtInProgress = $derived(config().showThoughtInProgress as boolean);

+	const hasReasoningError = $derived(
+		isLastAssistantMessage ? !!agenticLastError(message.convId) : false
+	);
+
 	let permissionDismissed = $state(false);

 	const pendingPermission = $derived(
@@ -293,11 +298,21 @@
 			</div>
 		</CollapsibleContentBlock>
 	{:else if section.type === AgenticSectionType.REASONING}
+		{@const reasoningSubtitle = section.wasInterrupted
+			? hasReasoningError
+				? 'Error'
+				: 'Cancelled'
+			: isStreaming
+				? ''
+				: undefined}
+
 		<CollapsibleContentBlock
 			open={isExpanded(index, section)}
 			class="my-2"
 			icon={Brain}
 			title="Reasoning"
+			subtitle={reasoningSubtitle}
+			rawContent={section.content}
 			onToggle={() => toggleExpanded(index, section)}
 		>
 			<div class="pt-3">
@@ -308,7 +323,7 @@
 		</CollapsibleContentBlock>
 	{:else if section.type === AgenticSectionType.REASONING_PENDING}
 		{@const reasoningTitle = isStreaming ? 'Reasoning...' : 'Reasoning'}
-		{@const reasoningSubtitle = isStreaming ? '' : 'incomplete'}
+		{@const reasoningSubtitle = isStreaming ? '' : hasReasoningError ? 'Error' : 'Cancelled'}

 		<CollapsibleContentBlock
 			open={isExpanded(index, section)}
@@ -316,6 +331,7 @@
 			icon={Brain}
 			title={reasoningTitle}
 			subtitle={reasoningSubtitle}
+			rawContent={section.content}
 			{isStreaming}
 			onToggle={() => toggleExpanded(index, section)}
 		>
--- a/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageStatistics/ChatMessageStatistics.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageStatistics/ChatMessageStatistics.svelte
@@ -6,6 +6,7 @@
 	import type { ChatMessageAgenticTimings } from '$lib/types/chat';
 	import { formatPerformanceTime } from '$lib/utils';
 	import { MS_PER_SECOND, DEFAULT_PERFORMANCE_TIME } from '$lib/constants';
+	import type { Component } from 'svelte';

 	interface Props {
 		predictedTokens?: number;
@@ -114,101 +115,79 @@
 	let formattedAgenticTotalTime = $derived(formatPerformanceTime(agenticTotalTimeMs));
 </script>

+{#snippet viewButton(opts: {
+	view: ChatMessageStatsView;
+	icon: Component;
+	label: string;
+	tooltipText: string;
+	disabled?: boolean;
+})}
+	{@const IconComponent = opts.icon}
+	<Tooltip.Root>
+		<Tooltip.Trigger>
+			<!-- prevent another nested button element -->
+			{#snippet child({ props })}
+				<button
+					{...props}
+					type="button"
+					class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
+					opts.view
+						? 'bg-background text-foreground shadow-sm'
+						: opts.disabled
+							? 'cursor-not-allowed opacity-40'
+							: 'hover:text-foreground'}"
+					onclick={() => !opts.disabled && (activeView = opts.view)}
+					disabled={opts.disabled}
+				>
+					<IconComponent class="h-3 w-3" />
+
+					<span class="sr-only">{opts.label}</span>
+				</button>
+			{/snippet}
+		</Tooltip.Trigger>
+
+		<Tooltip.Content>
+			<p>{opts.tooltipText}</p>
+		</Tooltip.Content>
+	</Tooltip.Root>
+{/snippet}
+
 <div class="inline-flex items-center text-xs text-muted-foreground">
 	<div class="inline-flex items-center rounded-sm bg-muted-foreground/15 p-0.5">
 		{#if hasPromptStats || isLive}
-			<Tooltip.Root>
-				<Tooltip.Trigger>
-					<button
-						type="button"
-						class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
-						ChatMessageStatsView.READING
-							? 'bg-background text-foreground shadow-sm'
-							: 'hover:text-foreground'}"
-						onclick={() => (activeView = ChatMessageStatsView.READING)}
-					>
-						<BookOpenText class="h-3 w-3" />
-
-						<span class="sr-only">Reading</span>
-					</button>
-				</Tooltip.Trigger>
-
-				<Tooltip.Content>
-					<p>Reading (prompt processing)</p>
-				</Tooltip.Content>
-			</Tooltip.Root>
+			{@render viewButton({
+				view: ChatMessageStatsView.READING,
+				icon: BookOpenText,
+				label: 'Reading',
+				tooltipText: 'Reading (prompt processing)'
+			})}
 		{/if}
-		<Tooltip.Root>
-			<Tooltip.Trigger>
-				<button
-					type="button"
-					class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
-					ChatMessageStatsView.GENERATION
-						? 'bg-background text-foreground shadow-sm'
-						: isGenerationDisabled
-							? 'cursor-not-allowed opacity-40'
-							: 'hover:text-foreground'}"
-					onclick={() => !isGenerationDisabled && (activeView = ChatMessageStatsView.GENERATION)}
-					disabled={isGenerationDisabled}
-				>
-					<Sparkles class="h-3 w-3" />

-					<span class="sr-only">Generation</span>
-				</button>
-			</Tooltip.Trigger>
-
-			<Tooltip.Content>
-				<p>
-					{isGenerationDisabled
-						? 'Generation (waiting for tokens...)'
-						: 'Generation (token output)'}
-				</p>
-			</Tooltip.Content>
-		</Tooltip.Root>
+		{@render viewButton({
+			view: ChatMessageStatsView.GENERATION,
+			icon: Sparkles,
+			label: 'Generation',
+			tooltipText: isGenerationDisabled
+				? 'Generation (waiting for tokens...)'
+				: 'Generation (token output)',
+			disabled: isGenerationDisabled
+		})}

 		{#if hasAgenticStats}
-			<Tooltip.Root>
-				<Tooltip.Trigger>
-					<button
-						type="button"
-						class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
-						ChatMessageStatsView.TOOLS
-							? 'bg-background text-foreground shadow-sm'
-							: 'hover:text-foreground'}"
-						onclick={() => (activeView = ChatMessageStatsView.TOOLS)}
-					>
-						<Wrench class="h-3 w-3" />
-
-						<span class="sr-only">Tools</span>
-					</button>
-				</Tooltip.Trigger>
-
-				<Tooltip.Content>
-					<p>Tool calls</p>
-				</Tooltip.Content>
-			</Tooltip.Root>
+			{@render viewButton({
+				view: ChatMessageStatsView.TOOLS,
+				icon: Wrench,
+				label: 'Tools',
+				tooltipText: 'Tool calls'
+			})}

 			{#if !hideSummary}
-				<Tooltip.Root>
-					<Tooltip.Trigger>
-						<button
-							type="button"
-							class="inline-flex h-5 w-5 items-center justify-center rounded-sm transition-colors {activeView ===
-							ChatMessageStatsView.SUMMARY
-								? 'bg-background text-foreground shadow-sm'
-								: 'hover:text-foreground'}"
-							onclick={() => (activeView = ChatMessageStatsView.SUMMARY)}
-						>
-							<Layers class="h-3 w-3" />
-
-							<span class="sr-only">Summary</span>
-						</button>
-					</Tooltip.Trigger>
-
-					<Tooltip.Content>
-						<p>Agentic summary</p>
-					</Tooltip.Content>
-				</Tooltip.Root>
+				{@render viewButton({
+					view: ChatMessageStatsView.SUMMARY,
+					icon: Layers,
+					label: 'Summary',
+					tooltipText: 'Agentic summary'
+				})}
 			{/if}
 		{/if}
 	</div>
--- a/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageStatistics/ChatMessageStatisticsBadge.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatMessages/ChatMessageStatistics/ChatMessageStatisticsBadge.svelte
@@ -21,13 +21,16 @@
 {#if tooltipLabel}
 	<Tooltip.Root>
 		<Tooltip.Trigger>
-			<BadgeInfo class={className} onclick={handleClick}>
-				{#snippet icon()}
-					<IconComponent class="h-3 w-3" />
-				{/snippet}
+			<!-- prevent another nested button element -->
+			{#snippet child({ props })}
+				<BadgeInfo {...props} class={className} onclick={handleClick}>
+					{#snippet icon()}
+						<IconComponent class="h-3 w-3" />
+					{/snippet}

-				{value}
-			</BadgeInfo>
+					{value}
+				</BadgeInfo>
+			{/snippet}
 		</Tooltip.Trigger>
 		<Tooltip.Content>
 			<p>{tooltipLabel}</p>
--- a/tools/ui/src/lib/components/app/chat/ChatScreen/ChatScreenActionScrollDown.svelte
+++ b/tools/ui/src/lib/components/app/chat/ChatScreen/ChatScreenActionScrollDown.svelte
@@ -41,16 +41,13 @@
 	});
 </script>

-<div
-	class="pointer-events-{show
-		? 'auto'
-		: 'none'} relative z-50 mx-auto mb-4 flex max-w-[48rem] justify-center"
->
+<div class="relative z-50 mx-auto mb-4 flex max-w-[48rem] justify-center">
 	<Button
 		onclick={scrollToBottom}
 		variant="secondary"
 		size="icon"
-		class="pointer-events-all absolute h-10 w-10 rounded-full bg-background/80 shadow-lg backdrop-blur-sm transition-all duration-200 hover:bg-muted/80"
+		disabled={!show}
+		class="pointer-events-auto absolute h-10 w-10 rounded-full bg-background/80 shadow-lg backdrop-blur-sm transition-all duration-200 hover:bg-muted/80"
 		style="bottom: {buttonBottom}; transform: translateY({show ? '0' : '2rem'}); opacity: {show
 			? 1
 			: 0};"
--- a/tools/ui/src/lib/components/app/content/CollapsibleContentBlock.svelte
+++ b/tools/ui/src/lib/components/app/content/CollapsibleContentBlock.svelte
@@ -4,6 +4,9 @@
 	import { buttonVariants } from '$lib/components/ui/button/index.js';
 	import { Card } from '$lib/components/ui/card';
 	import { createAutoScrollController } from '$lib/hooks/use-auto-scroll.svelte';
+	import { useThrottle } from '$lib/hooks/use-throttle.svelte';
+	import { formatReasoningPreview } from '$lib/utils';
+	import { config } from '$lib/stores/settings.svelte';
 	import type { Snippet } from 'svelte';
 	import type { Component } from 'svelte';

@@ -14,6 +17,8 @@
 		iconClass?: string;
 		title: string;
 		subtitle?: string;
+		preview?: string;
+		rawContent?: string;
 		isStreaming?: boolean;
 		onToggle?: () => void;
 		children: Snippet;
@@ -26,6 +31,8 @@
 		iconClass = 'h-4 w-4',
 		title,
 		subtitle,
+		preview,
+		rawContent,
 		isStreaming = false,
 		onToggle,
 		children
@@ -33,6 +40,20 @@

 	let contentContainer: HTMLDivElement | undefined = $state();

+	const showThoughtInProgress = $derived(config().showThoughtInProgress as boolean);
+
+	let previewKey = useThrottle(() => rawContent ?? preview ?? '', 500);
+	let displayedPreview = $state('');
+	let displayedOverflow = $state(0);
+
+	$effect(() => {
+		void previewKey.key;
+		const content = rawContent ?? preview ?? '';
+		const result = formatReasoningPreview(content);
+		displayedPreview = result.preview;
+		displayedOverflow = result.overflow;
+	});
+
 	const autoScroll = createAutoScrollController();

 	$effect(() => {
@@ -58,16 +79,31 @@
 	class={className}
 >
 	<Card class="gap-0 border-muted bg-muted/30 py-0">
-		<Collapsible.Trigger class="flex w-full cursor-pointer items-center justify-between p-3">
-			<div class="flex items-center gap-2 text-muted-foreground">
-				{#if IconComponent}
-					<IconComponent class={iconClass} />
-				{/if}
+		<Collapsible.Trigger class="flex w-full cursor-pointer items-start justify-between gap-2 p-3">
+			<div class="flex min-w-0 items-center gap-2">
+				<div class="flex items-center gap-2 text-muted-foreground">
+					{#if IconComponent}
+						<IconComponent class={iconClass} />
+					{/if}

-				<span class="font-mono text-sm font-medium">{title}</span>
+					<span class="font-mono text-sm font-medium">{title}</span>

-				{#if subtitle}
-					<span class="text-xs italic">{subtitle}</span>
+					{#if subtitle}
+						<span class="text-xs italic">{subtitle}</span>
+					{/if}
+				</div>
+
+				{#if displayedPreview && !showThoughtInProgress}
+					<div class="flex min-w-0 items-baseline justify-between gap-2">
+						<div class="w-3/4 truncate text-xs text-muted-foreground/80">
+							{displayedPreview}
+						</div>
+						{#if displayedOverflow > 0}
+							<span class="shrink-0 text-xs text-muted-foreground/60"
+								>{displayedOverflow}+ chars</span
+							>
+						{/if}
+					</div>
 				{/if}
 			</div>

--- a/tools/ui/src/lib/components/app/misc/HorizontalScrollCarousel.svelte
+++ b/tools/ui/src/lib/components/app/misc/HorizontalScrollCarousel.svelte
@@ -55,20 +55,20 @@
 	}

 	$effect(() => {
-		if (scrollContainer) {
-			setTimeout(() => {
-				updateScrollButtons();
-			}, 0);
-		}
+		if (!scrollContainer) return;
+
+		const observer = new ResizeObserver(() => updateScrollButtons());
+		observer.observe(scrollContainer);
+
+		return () => observer.disconnect();
 	});
 </script>

 <div class="relative {className}">
 	<button
-		class="absolute top-1/2 left-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-background/25 shadow-md backdrop-blur-xs transition-opacity hover:bg-background/45 {canScrollLeft
-			? 'opacity-100'
-			: 'pointer-events-none opacity-0'}"
+		class="absolute top-1/2 left-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-background/25 shadow-md backdrop-blur-xs transition-opacity hover:bg-background/45 disabled:pointer-events-none disabled:opacity-0"
 		onclick={scrollLeft}
+		disabled={!canScrollLeft}
 		aria-label="Scroll left"
 	>
 		<ChevronLeft class="h-4 w-4" />
@@ -83,10 +83,9 @@
 	</div>

 	<button
-		class="absolute top-1/2 right-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-background/25 shadow-md backdrop-blur-xs transition-opacity hover:bg-background/45 {canScrollRight
-			? 'opacity-100'
-			: 'pointer-events-none opacity-0'}"
+		class="absolute top-1/2 right-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-background/25 shadow-md backdrop-blur-xs transition-opacity hover:bg-background/45 disabled:pointer-events-none disabled:opacity-0"
 		onclick={scrollRight}
+		disabled={!canScrollRight}
 		aria-label="Scroll right"
 	>
 		<ChevronRight class="h-4 w-4" />
--- a/tools/ui/src/lib/components/app/models/ModelBadge.svelte
+++ b/tools/ui/src/lib/components/app/models/ModelBadge.svelte
@@ -27,8 +27,8 @@
 	let shouldShow = $derived(model && (modelProp !== undefined || isModelMode));
 </script>

-{#snippet badgeContent()}
-	<BadgeInfo class={className} {onclick}>
+{#snippet badgeContent(triggerProps?: Record<string, unknown>)}
+	<BadgeInfo {...triggerProps ?? {}} class={className} {onclick}>
 		{#snippet icon()}
 			<Package class="h-3 w-3" />
 		{/snippet}
@@ -47,7 +47,10 @@
 	{#if showTooltip}
 		<Tooltip.Root>
 			<Tooltip.Trigger>
-				{@render badgeContent()}
+				<!-- prevent another nested button element -->
+				{#snippet child({ props })}
+					{@render badgeContent(props)}
+				{/snippet}
 			</Tooltip.Trigger>

 			<Tooltip.Content>
--- a/tools/ui/src/lib/components/app/models/ModelsSelectorDropdown.svelte
+++ b/tools/ui/src/lib/components/app/models/ModelsSelectorDropdown.svelte
@@ -116,52 +116,54 @@

 		{#if ms.isRouter}
 			<DropdownMenu.Root bind:open={isOpen} onOpenChange={ms.handleOpenChange}>
-				<DropdownMenu.Trigger
-					class={[
-						`inline-flex cursor-pointer items-center gap-1.5 rounded-sm bg-background px-1.5 py-1 text-xs shadow-sm transition hover:bg-muted-foreground/20 focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-60 dark:bg-muted-foreground/15 dark:text-secondary-foreground`,
-						!ms.isCurrentModelInCache
-							? 'bg-red-400/10 !text-red-400 hover:bg-red-400/20 hover:text-red-400'
-							: forceForegroundText
-								? 'text-foreground'
-								: ms.isHighlightedCurrentModelActive
-									? 'text-foreground'
-									: 'text-foreground',
-						isOpen && 'text-foreground',
-						'max-w-[min(calc(100vw-4rem) md:max-w-[min(calc(100cqw-9rem),25rem)]'
-					]}
-					disabled={disabled || ms.updating}
-				>
-					<Package class="h-3.5 w-3.5 shrink-0" />
+				<Tooltip.Root>
+					<Tooltip.Trigger>
+						<!-- prevent another nested button element -->
+						{#snippet child({ props })}
+							<DropdownMenu.Trigger
+								{...props}
+								class={[
+									`inline-grid cursor-pointer grid-cols-[1fr_auto_1fr] items-center gap-1.5 rounded-sm bg-background px-1.5 py-1 text-xs shadow-sm transition hover:bg-muted-foreground/20 focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-60 dark:bg-muted-foreground/15 dark:text-secondary-foreground`,
+									!ms.isCurrentModelInCache
+										? 'bg-red-400/10 !text-red-400 hover:bg-red-400/20 hover:text-red-400'
+										: forceForegroundText
+											? 'text-foreground'
+											: ms.isHighlightedCurrentModelActive
+												? 'text-foreground'
+												: 'text-foreground',
+									isOpen && 'text-foreground',
+									'max-w-[min(calc(100vw-4rem) md:max-w-[min(calc(100cqw-9rem),25rem)]'
+								]}
+								disabled={disabled || ms.updating}
+							>
+								<Package class="h-3.5 w-3.5 shrink-0" />

-					{#if selectedOption}
-						<Tooltip.Root>
-							<Tooltip.Trigger>
-								<!-- prevent another nested button element -->
-								{#snippet child({ props })}
+								{#if selectedOption}
 									<ModelId
 										modelId={selectedOption.model}
 										class="min-w-0 overflow-hidden"
 										hideOrgName={false}
 										hideQuantization
-										{...props}
 									/>
-								{/snippet}
-							</Tooltip.Trigger>
+								{:else}
+									<span class="min-w-0 font-medium">Select model</span>
+								{/if}

-							<Tooltip.Content>
-								<p class="font-mono">{selectedOption.model}</p>
-							</Tooltip.Content>
-						</Tooltip.Root>
-					{:else}
-						<span class="min-w-0 font-medium">Select model</span>
-					{/if}
+								{#if ms.updating || ms.isLoadingModel}
+									<Loader2 class="h-3 w-3.5 shrink-0 animate-spin" />
+								{:else}
+									<ChevronDown class="h-3 w-3.5 shrink-0" />
+								{/if}
+							</DropdownMenu.Trigger>
+						{/snippet}
+					</Tooltip.Trigger>

-					{#if ms.updating || ms.isLoadingModel}
-						<Loader2 class="h-3 w-3.5 shrink-0 animate-spin" />
-					{:else}
-						<ChevronDown class="h-3 w-3.5 shrink-0" />
+					{#if selectedOption}
+						<Tooltip.Content>
+							<p class="font-mono">{selectedOption.model}</p>
+						</Tooltip.Content>
 					{/if}
-				</DropdownMenu.Trigger>
+				</Tooltip.Root>

 				<DropdownMenu.Content
 					align="end"
@@ -234,49 +236,51 @@
 				</DropdownMenu.Content>
 			</DropdownMenu.Root>
 		{:else}
-			<button
-				class={[
-					`inline-flex cursor-pointer items-center gap-1.5 rounded-sm bg-background px-1.5 py-1 text-xs shadow-sm transition hover:bg-muted-foreground/20 focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-60 dark:bg-muted-foreground/15 dark:text-secondary-foreground`,
-					!ms.isCurrentModelInCache
-						? 'bg-red-400/10 !text-red-400 hover:bg-red-400/20 hover:text-red-400'
-						: forceForegroundText
-							? 'text-foreground'
-							: ms.isHighlightedCurrentModelActive
-								? 'text-foreground'
-								: 'text-foreground',
-					isOpen && 'text-foreground'
-				]}
-				style="max-width: min(calc(100cqw - 6.5rem), 32rem)"
-				onclick={() => ms.handleOpenChange(true)}
-				disabled={disabled || ms.updating}
-			>
-				<Package class="h-3.5 w-3.5 shrink-0" />
+			<Tooltip.Root>
+				<Tooltip.Trigger>
+					<!-- prevent another nested button element -->
+					{#snippet child({ props })}
+						<button
+							{...props}
+							class={[
+								`inline-flex cursor-pointer items-center gap-1.5 rounded-sm bg-background px-1.5 py-1 text-xs shadow-sm transition hover:bg-muted-foreground/20 focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-60 dark:bg-muted-foreground/15 dark:text-secondary-foreground`,
+								!ms.isCurrentModelInCache
+									? 'bg-red-400/10 !text-red-400 hover:bg-red-400/20 hover:text-red-400'
+									: forceForegroundText
+										? 'text-foreground'
+										: ms.isHighlightedCurrentModelActive
+											? 'text-foreground'
+											: 'text-foreground',
+								isOpen && 'text-foreground'
+							]}
+							style="max-width: min(calc(100cqw - 6.5rem), 32rem)"
+							onclick={() => ms.handleOpenChange(true)}
+							disabled={disabled || ms.updating}
+						>
+							<Package class="h-3.5 w-3.5 shrink-0" />

-				{#if selectedOption}
-					<Tooltip.Root>
-						<Tooltip.Trigger>
-							<!-- prevent another nested button element -->
-							{#snippet child({ props })}
+							{#if selectedOption}
 								<ModelId
 									modelId={selectedOption.model}
 									class="min-w-0 overflow-hidden"
 									hideOrgName={false}
 									hideQuantization
-									{...props}
 								/>
-							{/snippet}
-						</Tooltip.Trigger>
+							{/if}

-						<Tooltip.Content>
-							<p class="font-mono">{selectedOption.model}</p>
-						</Tooltip.Content>
-					</Tooltip.Root>
-				{/if}
+							{#if ms.updating}
+								<Loader2 class="h-3 w-3.5 shrink-0 animate-spin" />
+							{/if}
+						</button>
+					{/snippet}
+				</Tooltip.Trigger>

-				{#if ms.updating}
-					<Loader2 class="h-3 w-3.5 shrink-0 animate-spin" />
+				{#if selectedOption}
+					<Tooltip.Content>
+						<p class="font-mono">{selectedOption.model}</p>
+					</Tooltip.Content>
 				{/if}
-			</button>
+			</Tooltip.Root>
 		{/if}
 	{/if}
 </div>
--- a/tools/ui/src/lib/components/app/navigation/DropdownMenuActions.svelte
+++ b/tools/ui/src/lib/components/app/navigation/DropdownMenuActions.svelte
@@ -34,24 +34,28 @@
 </script>

 <DropdownMenu.Root bind:open>
-	<DropdownMenu.Trigger
-		class="flex h-6 w-6 cursor-pointer items-center justify-center rounded-md p-0 text-sm font-medium transition-colors hover:bg-accent hover:text-accent-foreground focus:bg-accent focus:text-accent-foreground focus:outline-none disabled:pointer-events-none disabled:opacity-50 data-[state=open]:bg-accent data-[state=open]:text-accent-foreground {triggerClass}"
-		onclick={(e) => e.stopPropagation()}
-	>
-		{#if triggerTooltip}
-			<Tooltip.Root>
-				<Tooltip.Trigger>
+	<Tooltip.Root>
+		<Tooltip.Trigger>
+			<!-- prevent another nested button element -->
+			{#snippet child({ props })}
+				<DropdownMenu.Trigger
+					{...props}
+					class="flex h-6 w-6 cursor-pointer items-center justify-center rounded-md p-0 text-sm font-medium transition-colors hover:bg-accent hover:text-accent-foreground focus:bg-accent focus:text-accent-foreground focus:outline-none disabled:pointer-events-none disabled:opacity-50 data-[state=open]:bg-accent data-[state=open]:text-accent-foreground {triggerClass}"
+					onclick={(e) => e.stopPropagation()}
+				>
 					{@render iconComponent(triggerIcon, 'h-3 w-3')}
-					<span class="sr-only">{triggerTooltip}</span>
-				</Tooltip.Trigger>
-				<Tooltip.Content>
-					<p>{triggerTooltip}</p>
-				</Tooltip.Content>
-			</Tooltip.Root>
-		{:else}
-			{@render iconComponent(triggerIcon, 'h-3 w-3')}
+					{#if triggerTooltip}
+						<span class="sr-only">{triggerTooltip}</span>
+					{/if}
+				</DropdownMenu.Trigger>
+			{/snippet}
+		</Tooltip.Trigger>
+		{#if triggerTooltip}
+			<Tooltip.Content>
+				<p>{triggerTooltip}</p>
+			</Tooltip.Content>
 		{/if}
-	</DropdownMenu.Trigger>
+	</Tooltip.Root>

 	<DropdownMenu.Content {align} class="z-[999999] w-48">
 		{#each actions as action, index (action.label)}
--- a/tools/ui/src/lib/components/app/navigation/SidebarNavigation/SidebarNavigationConversationItem.svelte
+++ b/tools/ui/src/lib/components/app/navigation/SidebarNavigation/SidebarNavigationConversationItem.svelte
@@ -105,6 +105,12 @@
 	onclick={handleSelect}
 	onmouseover={handleMouseOver}
 	onmouseleave={handleMouseLeave}
+	onfocusin={handleMouseOver}
+	onfocusout={(e) => {
+		if (!e.currentTarget.contains(e.relatedTarget as Node | null)) {
+			handleMouseLeave();
+		}
+	}}
 >
 	<div
 		class="flex min-w-0 flex-1 items-center gap-2"
@@ -113,12 +119,16 @@
 		{#if depth > 0}
 			<Tooltip.Root>
 				<Tooltip.Trigger>
-					<a
-						href={RouterService.chat(conversation.forkedFromConversationId)}
-						class="flex shrink-0 items-center text-muted-foreground transition-colors hover:text-foreground"
-					>
-						<GitBranch class="h-3.5 w-3.5" />
-					</a>
+					<!-- prevent another nested button element -->
+					{#snippet child({ props })}
+						<a
+							{...props}
+							href={RouterService.chat(conversation.forkedFromConversationId)}
+							class="flex shrink-0 items-center text-muted-foreground transition-colors hover:text-foreground"
+						>
+							<GitBranch class="h-3.5 w-3.5" />
+						</a>
+					{/snippet}
 				</Tooltip.Trigger>

 				<Tooltip.Content>
@@ -195,7 +205,8 @@
 			opacity: 0;
 		}

-		&:is(:hover) :global([data-slot='dropdown-menu-trigger']) {
+		&:is(:hover) :global([data-slot='dropdown-menu-trigger']),
+		&:focus-within :global([data-slot='dropdown-menu-trigger']) {
 			opacity: 1;
 		}
 		@media (max-width: 768px) {
--- a/tools/ui/src/lib/constants/formatters.ts
+++ b/tools/ui/src/lib/constants/formatters.ts
@@ -6,3 +6,30 @@ export const MEDIUM_DURATION_THRESHOLD = 10;

 /** Default display value when no performance time is available */
 export const DEFAULT_PERFORMANCE_TIME = '0s';
+
+/** Max length before reasoning preview is truncated */
+export const MAX_PREVIEW_LENGTH = 120;
+
+export const STRIP_MARKDOWN_CAPTURE_PATTERNS: [RegExp, string][] = [
+	[/^```(.*)/gm, '$1'],
+	[/(.*)```$/gm, '$1'],
+	[/`([^`]*)`/g, '$1'],
+	[/\*\*(.*?)\*\*/g, '$1'],
+	[/__(.*?)__/g, '$1'],
+	[/\*(.*?)\*/g, '$1'],
+	[/_(.*?)_/g, '$1']
+];
+
+/* eslint-disable no-misleading-character-class */
+export const STRIP_MARKDOWN_INLINE_REGEX = new RegExp(
+	[
+		'<[^>]*>',
+		'^>\\s*',
+		'^#{1,6}\\s+',
+		'^[\\s]*[-*+]\\s+',
+		'^[\\s]*\\d+[.)]\\s+',
+		'[\\u{1F600}-\\u{1F64F}\\u{1F300}-\\u{1F5FF}\\u{1F680}-\\u{1F6FF}\\u{1F1E0}-\\u{1F1FF}\\u{2600}-\\u{26FF}\\u{2700}-\\u{27BF}\\u{FE00}-\\u{FE0F}\\u{1F900}-\\u{1F9FF}\\u{1FA00}-\\u{1FA6F}\\u{1FA70}-\\u{1FAFF}\\u{200D}\\u{20E3}\\u{231A}-\\u{231B}\\u{23E9}-\\u{23F3}\\u{23F8}-\\u{23FA}\\u{25AA}-\\u{25AB}\\u{25B6}\\u{25C0}\\u{25FB}-\\u{25FE}\\u{2934}-\\u{2935}\\u{2B05}-\\u{2B07}\\u{2B1B}-\\u{2B1C}\\u{2B50}\\u{2B55}\\u{3030}\\u{303D}\\u{3297}\\u{3299}]'
+	].join('|'),
+	'gmu'
+);
+/* eslint-enable no-misleading-character-class */
--- a/tools/ui/src/lib/hooks/use-throttle.svelte.ts
+++ b/tools/ui/src/lib/hooks/use-throttle.svelte.ts
@@ -0,0 +1,32 @@
+/**
+ * Creates a reactive throttle key that increments when `getValue()` changes
+ * and the throttle window has elapsed since the last increment.
+ *
+ * Useful for throttling animations that should not fire on every rapid update.
+ *
+ * @param getValue - A reactive getter for the value to watch
+ * @param ms - Throttle window in milliseconds
+ * @returns A reactive number that increments when the throttled value changes
+ */
+export function useThrottle(getValue: () => string | undefined, ms: number) {
+	let key = $state(0);
+	let throttleEnd = $state(0);
+	let lastValue: string | undefined = getValue();
+
+	$effect(() => {
+		const value = getValue();
+		if (value === lastValue) return;
+		const now = Date.now();
+		if (now >= throttleEnd) {
+			lastValue = value;
+			key++;
+			throttleEnd = now + ms;
+		}
+	});
+
+	return {
+		get key() {
+			return key;
+		}
+	};
+}
--- a/tools/ui/src/lib/utils/agentic.ts
+++ b/tools/ui/src/lib/utils/agentic.ts
@@ -18,6 +18,7 @@ export interface AgenticSection {
 	toolArgs?: string;
 	toolResult?: string;
 	toolResultExtras?: DatabaseMessageExtra[];
+	wasInterrupted?: boolean;
 }

 /**
@@ -51,7 +52,8 @@ function deriveSingleTurnSections(
 		const isPending = isStreaming && !hasContentAfterReasoning;
 		sections.push({
 			type: isPending ? AgenticSectionType.REASONING_PENDING : AgenticSectionType.REASONING,
-			content: message.reasoningContent
+			content: message.reasoningContent,
+			wasInterrupted: !isStreaming && !hasContentAfterReasoning
 		});
 	}

--- a/tools/ui/src/lib/utils/formatters.ts
+++ b/tools/ui/src/lib/utils/formatters.ts
@@ -3,7 +3,11 @@ import {
 	SECONDS_PER_MINUTE,
 	SECONDS_PER_HOUR,
 	SHORT_DURATION_THRESHOLD,
-	MEDIUM_DURATION_THRESHOLD
+	MEDIUM_DURATION_THRESHOLD,
+	MAX_PREVIEW_LENGTH,
+	STRIP_MARKDOWN_INLINE_REGEX,
+	STRIP_MARKDOWN_CAPTURE_PATTERNS,
+	NEWLINE_SEPARATOR
 } from '$lib/constants';

 /**
@@ -151,3 +155,33 @@ export function formatAttachmentText(
 	const header = extra ? `${name} (${extra})` : name;
 	return `\n\n--- ${label}: ${header} ---\n${content}`;
 }
+
+export function formatReasoningPreview(content: string): { preview: string; overflow: number } {
+	if (!content) return { preview: '', overflow: 0 };
+
+	const lines = content.split(NEWLINE_SEPARATOR);
+	let lastLine = '';
+
+	for (let i = lines.length - 1; i >= 0; i--) {
+		let cleaned = lines[i].trim();
+		if (!cleaned) continue;
+
+		cleaned = cleaned.replace(STRIP_MARKDOWN_INLINE_REGEX, '');
+		for (const [pattern, replacement] of STRIP_MARKDOWN_CAPTURE_PATTERNS) {
+			cleaned = cleaned.replace(pattern, replacement);
+		}
+
+		if (cleaned.length > 0) {
+			lastLine = cleaned;
+			break;
+		}
+	}
+
+	const fullLength = lastLine.length;
+	const overflow = Math.max(0, fullLength - MAX_PREVIEW_LENGTH);
+	if (fullLength > MAX_PREVIEW_LENGTH) {
+		lastLine = lastLine.slice(0, MAX_PREVIEW_LENGTH) + '...';
+	}
+
+	return { preview: lastLine, overflow };
+}
--- a/tools/ui/src/lib/utils/index.ts
+++ b/tools/ui/src/lib/utils/index.ts
@@ -76,7 +76,8 @@ export {
 	formatJsonPretty,
 	formatTime,
 	formatPerformanceTime,
-	formatAttachmentText
+	formatAttachmentText,
+	formatReasoningPreview
 } from './formatters';

 // IME utilities
--- a/tools/ui/tests/stories/SidebarNavigation.stories.svelte
+++ b/tools/ui/tests/stories/SidebarNavigation.stories.svelte
@@ -58,10 +58,12 @@
 	name="Default"
 	play={async () => {
 		const { conversationsStore } = await import('$lib/stores/conversations.svelte');
-		
-		waitFor(() => setTimeout(() => {
-			conversationsStore.conversations = mockConversations;
-		}, 0));
+
+		waitFor(() =>
+			setTimeout(() => {
+				conversationsStore.conversations = mockConversations;
+			}, 0)
+		);
 	}}
 >
 	<Sidebar.Provider bind:open={sidebarOpen}>
@@ -76,11 +78,13 @@
 	name="SearchActive"
 	play={async ({ userEvent }) => {
 		const { conversationsStore } = await import('$lib/stores/conversations.svelte');
-		
-		waitFor(() => setTimeout(() => {
-			conversationsStore.conversations = mockConversations;
-		}, 0));
-		
+
+		waitFor(() =>
+			setTimeout(() => {
+				conversationsStore.conversations = mockConversations;
+			}, 0)
+		);
+
 		const searchTrigger = screen.getByText('Search');
 		userEvent.click(searchTrigger);
 	}}
--- a/tools/ui/tests/stories/a11y/ActionIcon.a11y.stories.svelte
+++ b/tools/ui/tests/stories/a11y/ActionIcon.a11y.stories.svelte
@@ -0,0 +1,34 @@
+<script module lang="ts">
+	import { defineMeta } from '@storybook/addon-svelte-csf';
+	import { Copy } from '@lucide/svelte';
+	import ActionIcon from '$lib/components/app/actions/ActionIcon.svelte';
+	import { expect } from 'storybook/test';
+
+	const { Story } = defineMeta({
+		title: 'Components/ActionIcon/Accessibility',
+		component: ActionIcon,
+		parameters: {
+			layout: 'centered'
+		},
+		tags: ['!dev']
+	});
+</script>
+
+<Story
+	asChild
+	name="SingleTabStop"
+	play={async ({ canvas, userEvent }) => {
+		const before = await canvas.findByRole('button', { name: 'before' });
+		const target = await canvas.findByRole('button', { name: 'Copy' });
+
+		before.focus();
+		await userEvent.tab();
+
+		await expect(target).toHaveFocus();
+	}}
+>
+	<div>
+		<button type="button">before</button>
+		<ActionIcon icon={Copy} tooltip="Copy" onclick={() => {}} />
+	</div>
+</Story>
--- a/tools/ui/tests/stories/a11y/ChatMessageStatistics.a11y.stories.svelte
+++ b/tools/ui/tests/stories/a11y/ChatMessageStatistics.a11y.stories.svelte
@@ -0,0 +1,50 @@
+<script module lang="ts">
+	import { defineMeta } from '@storybook/addon-svelte-csf';
+	import ChatMessageStatistics from '$lib/components/app/chat/ChatMessages/ChatMessageStatistics/ChatMessageStatistics.svelte';
+	import { expect } from 'storybook/test';
+
+	const { Story } = defineMeta({
+		title: 'Components/ChatMessageStatistics/Accessibility',
+		component: ChatMessageStatistics,
+		parameters: {
+			layout: 'centered'
+		},
+		tags: ['!dev']
+	});
+</script>
+
+<Story
+	name="ViewButtonsSingleTabStop"
+	args={{
+		promptTokens: 100,
+		promptMs: 500,
+		predictedTokens: 200,
+		predictedMs: 1000,
+		agenticTimings: {
+			turns: 1,
+			toolCallsCount: 1,
+			toolsMs: 500,
+			llm: { predicted_n: 200, predicted_ms: 1000, prompt_n: 100, prompt_ms: 500 }
+		},
+		hideSummary: false,
+		isLive: false
+	}}
+	play={async ({ canvas, userEvent }) => {
+		const reading = await canvas.findByRole('button', { name: 'Reading' });
+		const generation = await canvas.findByRole('button', { name: 'Generation' });
+		const tools = await canvas.findByRole('button', { name: 'Tools' });
+		const summary = await canvas.findByRole('button', { name: 'Summary' });
+
+		reading.focus();
+		await expect(reading).toHaveFocus();
+
+		await userEvent.tab();
+		await expect(generation).toHaveFocus();
+
+		await userEvent.tab();
+		await expect(tools).toHaveFocus();
+
+		await userEvent.tab();
+		await expect(summary).toHaveFocus();
+	}}
+/>
--- a/tools/ui/tests/stories/a11y/ChatScreenForm.a11y.stories.svelte
+++ b/tools/ui/tests/stories/a11y/ChatScreenForm.a11y.stories.svelte
--- a/tools/ui/tests/stories/a11y/HorizontalScrollCarousel.a11y.stories.svelte
+++ b/tools/ui/tests/stories/a11y/HorizontalScrollCarousel.a11y.stories.svelte
@@ -0,0 +1,69 @@
+<script module lang="ts">
+	import { defineMeta } from '@storybook/addon-svelte-csf';
+	import HorizontalScrollCarousel from '$lib/components/app/misc/HorizontalScrollCarousel.svelte';
+	import { expect, waitFor } from 'storybook/test';
+
+	const { Story } = defineMeta({
+		title: 'Components/HorizontalScrollCarousel/Accessibility',
+		component: HorizontalScrollCarousel,
+		parameters: {
+			layout: 'centered'
+		},
+		tags: ['!dev']
+	});
+</script>
+
+<Story
+	asChild
+	name="ArrowsNotInTabOrderWhenNotScrollable"
+	play={async ({ canvas, userEvent }) => {
+		const before = await canvas.findByRole('button', { name: 'before' });
+		const after = await canvas.findByRole('button', { name: 'after' });
+		const leftArrow = await canvas.findByRole('button', { name: 'Scroll left' });
+
+		await waitFor(() => {
+			expect(leftArrow).toBeDisabled();
+		});
+
+		before.focus();
+		await userEvent.tab();
+
+		await expect(after).toHaveFocus();
+	}}
+>
+	<div>
+		<button type="button">before</button>
+		<HorizontalScrollCarousel class="w-96">
+			<div class="h-12 w-12 shrink-0 bg-muted"></div>
+			<div class="h-12 w-12 shrink-0 bg-muted"></div>
+		</HorizontalScrollCarousel>
+		<button type="button">after</button>
+	</div>
+</Story>
+
+<Story
+	asChild
+	name="ArrowsInTabOrderWhenScrollable"
+	play={async ({ canvas, userEvent }) => {
+		const before = await canvas.findByRole('button', { name: 'before' });
+		const rightArrow = await canvas.findByRole('button', { name: 'Scroll right' });
+
+		await waitFor(() => {
+			expect(rightArrow).not.toBeDisabled();
+		});
+
+		before.focus();
+		await userEvent.tab();
+
+		await expect(rightArrow).toHaveFocus();
+	}}
+>
+	<div>
+		<button type="button">before</button>
+		<HorizontalScrollCarousel class="w-48">
+			{#each [...Array(20).keys()] as i (i)}
+				<div class="h-12 w-24 shrink-0 bg-muted">{i}</div>
+			{/each}
+		</HorizontalScrollCarousel>
+	</div>
+</Story>
--- a/tools/ui/tests/stories/a11y/SidebarNavigationConversationItem.a11y.stories.svelte
+++ b/tools/ui/tests/stories/a11y/SidebarNavigationConversationItem.a11y.stories.svelte
@@ -0,0 +1,36 @@
+<script module lang="ts">
+	import { defineMeta } from '@storybook/addon-svelte-csf';
+	import SidebarNavigationConversationItem from '$lib/components/app/navigation/SidebarNavigation/SidebarNavigationConversationItem.svelte';
+	import { expect } from 'storybook/test';
+
+	const mockForkedConversation: DatabaseConversation = {
+		id: 'conv-2',
+		name: 'Forked Conversation',
+		lastModified: Date.now(),
+		currNode: 'msg-2',
+		forkedFromConversationId: 'conv-1'
+	};
+
+	const { Story } = defineMeta({
+		title: 'Components/SidebarNavigationConversationItem/Accessibility',
+		component: SidebarNavigationConversationItem,
+		parameters: {
+			layout: 'centered'
+		},
+		tags: ['!dev']
+	});
+</script>
+
+<Story
+	name="ForkIconSingleTabStop"
+	args={{ conversation: mockForkedConversation, depth: 1 }}
+	play={async ({ canvas, userEvent }) => {
+		const row = await canvas.findByRole('button', { name: /Forked Conversation/ });
+		const forkIcon = await canvas.findByRole('link');
+
+		row.focus();
+		await userEvent.tab();
+
+		await expect(forkIcon).toHaveFocus();
+	}}
+/>
--- a/tools/ui/vite.config.ts
+++ b/tools/ui/vite.config.ts
@@ -7,11 +7,23 @@ import { defineConfig, searchForWorkspaceRoot } from 'vite';
 import devtoolsJson from 'vite-plugin-devtools-json';
 import { storybookTest } from '@storybook/addon-vitest/vitest-plugin';
 import { llamaCppBuildPlugin } from './scripts/vite-plugin-llama-cpp-build';
+import { playwright } from '@vitest/browser-playwright';

 const __dirname = dirname(fileURLToPath(import.meta.url));

 const SERVER_ORIGIN = import.meta.env?.VITE_PUBLIC_SERVER_ORIGIN || 'http://localhost:8080';

+// eslint-disable-next-line @typescript-eslint/no-explicit-any
+const browserBaseConfig: any = {
+	enabled: true,
+	provider: playwright({
+		launchOptions: {
+			args: ['--no-sandbox']
+		}
+	}),
+	instances: [{ browser: 'chromium' }]
+};
+
 export default defineConfig({
 	resolve: {
 		alias: {
@@ -33,12 +45,7 @@ export default defineConfig({
 				extends: './vite.config.ts',
 				test: {
 					name: 'client',
-					environment: 'browser',
-					browser: {
-						enabled: true,
-						provider: 'playwright',
-						instances: [{ browser: 'chromium' }]
-					},
+					browser: browserBaseConfig,
 					include: ['tests/client/**/*.svelte.{test,spec}.{js,ts}'],
 					setupFiles: ['./vitest-setup-client.ts']
 				}
@@ -57,13 +64,7 @@ export default defineConfig({
 				extends: './vite.config.ts',
 				test: {
 					name: 'ui',
-					environment: 'browser',
-					browser: {
-						enabled: true,
-						provider: 'playwright',
-						instances: [{ browser: 'chromium', headless: true }]
-					},
-					include: ['tests/stories/**/*.stories.{js,ts,svelte}'],
+					browser: { ...browserBaseConfig, instances: [{ browser: 'chromium', headless: true }] },
 					setupFiles: ['./.storybook/vitest.setup.ts']
 				},
 				plugins: [
Author	SHA1	Message	Date
Oliver Simons	2154a0fdcf	CUDA: enroll mul_mat_vec_q_moe into pdl (#24087 ) * Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels	2026-06-05 08:37:34 +02:00
Daniel Bevenius	46fa662b1f	ci : build-msys job slimming [no ci] (#24157 ) This PR attempts to slim down the dependencies for build-msys jobs making the same changes that we applied in whisper.cpp to reduce the size of the github actions cache, and should also improve the run time due to fewer dependencies that need to be installed. I realize this is a scheduled job but I think it would still make sense to apply these changes. Refs: https://github.com/ggml-org/whisper.cpp/pull/3858	2026-06-05 07:57:36 +02:00
Mason Milburn	7fe2ae45ab	sycl : port multi-column MMVQ from CUDA backend (#21845 ) mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.	2026-06-05 08:10:31 +03:00
Georgi Gerganov	7c158fbb4a	server : disable on-device spec checkpoints (#24108 )	2026-06-04 19:30:59 +03:00
Xuan-Son Nguyen	260862b8ca	arg: fix double mtp downloads (#24128 )	2026-06-04 19:23:48 +03:00
viggy	42b2d60e57	webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (#23132 ) * use child snippets for landing and chat message elements * make ... icon visible in conversation history menu * conversation history forward tab fix * add snippet fix for fork icon in conversation history * focus/keyboard fix for attachment x icon and scroll left/right * formatting * fix scroll down issue * simply Statistics and pointer events in scrolldown * create storybook tests and move to folder * improve tests to actually assert on element	2026-06-04 17:59:00 +02:00
Bartowski	e7bcf1c3a8	Move duplicated imatrix code into single common imatrix-loader.cpp (#22445 ) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata	2026-06-04 17:45:40 +02:00
Aleksander Grygier	21444c822e	ui: Fixed packages (#24119 ) * chore(ui): pin package versions to currently installed - Update all dependencies and devDependencies to match exactly what's in package-lock.json - This ensures reproducible builds by locking to specific versions rather than semver ranges * chore: Update packages * chore: Move remaining dependencies to devDependencies * fix: Add missing `mermaid` package * chore: Update `cookie` package to `v1.1.1` * chore: Formatting * test: Update test configs	2026-06-04 16:23:08 +02:00
MagicExists	526977068f	ui: added single line reasoning preview (#23601 ) * webui: added single line reasoning preview. * patch: reduce width slightly for the previewing section * refactor: move formatter constants to the right file * feat: reimplement reasoning preview with throttled dynamic per-line rendering * chore: fix spacing Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: refactor to requested changes * refactor: grouped by capture pattern instead of block-level + inline * ui: fax interrupt state only trigger for 1st reasoning message * chore: make reasoning preview respects showThoughtInProgress setting * chore; newline at EOF Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: thread rawContent so collapsible content can handle compute preview * patch: showThoughtInProgress accidentally blocks rawContent being passed * chore: fix lint * chore: change smoke test --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-04 16:09:43 +02:00
forforever73	0dbfa66a1f	return filter to save memory (#24125 ) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-06-04 15:56:33 +02:00
Pedro Cuenca	e8023568d0	convert: Fix Gemma 4 Unified conversion (#24118 ) * Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim	2026-06-04 15:21:38 +02:00