server-parallel : add "--reverse-prompt" + compiler warning fixes

fix json format README
remove trail whitespace
2026-04-23 03:19:41 +02:00 · 2023-10-06 14:32:19 +03:00 · 2023-10-05 15:23:58 -04:00 · 2023-10-05 15:18:47 -04:00 · 2023-10-05 15:12:39 -04:00
29 changed files with 1694 additions and 1394 deletions
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -10,10 +10,10 @@ on:
  push:
    branches:
      - master
-    paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m']
+    paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']
  pull_request:
    types: [opened, synchronize, reopened]
-    paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m']
+    paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']

 env:
  BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
@@ -258,7 +258,7 @@ jobs:

    strategy:
      matrix:
-        destination: ['generic/platform=macOS', 'generic/platform=iOS', 'generic/platform=tvOS']
+        destination: ['platform=macOS,name=Any Mac', 'platform=iOS,name=Any iOS Device', 'platform=tvOS,name=Any tvOS Device']

    steps:
      - name: Clone
--- a/.github/workflows/gguf-publish.yml
+++ b/.github/workflows/gguf-publish.yml
@@ -36,9 +36,8 @@ jobs:
        poetry install

    - name: Build package
-      run: cd gguf-py && poetry build
+      run: poetry build
    - name: Publish package
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
        password: ${{ secrets.PYPI_API_TOKEN }}
-        packages-dir: gguf-py/dist
--- a/.gitignore
+++ b/.gitignore
@@ -10,7 +10,6 @@
 *.gcno
 *.gcda
 *.dot
-*.metallib
 .DS_Store
 .build/
 .cache/
--- a/Package.swift
+++ b/Package.swift
@@ -10,18 +10,15 @@ let platforms: [SupportedPlatform]? = [
    .tvOS(.v14)
 ]
 let exclude: [String] = []
-let resources: [Resource] = [
-    .process("ggml-metal.metal")
-]
-let additionalSources: [String] = ["ggml-metal.m"]
+let additionalSources: [String] = ["ggml-metal.m", "ggml-metal.metal"]
 let additionalSettings: [CSetting] = [
    .unsafeFlags(["-fno-objc-arc"]),
+    .define("GGML_SWIFT"),
    .define("GGML_USE_METAL")
 ]
 #else
 let platforms: [SupportedPlatform]? = nil
 let exclude: [String] = ["ggml-metal.metal"]
-let resources: [Resource] = []
 let additionalSources: [String] = []
 let additionalSettings: [CSetting] = []
 #endif
@@ -43,7 +40,6 @@ let package = Package(
                "ggml-alloc.c",
                "k_quants.c",
            ] + additionalSources,
-            resources: resources,
            publicHeadersPath: "spm-headers",
            cSettings: [
                .unsafeFlags(["-Wno-shorten-64-to-32"]),
--- a/README.md
+++ b/README.md
@@ -95,7 +95,6 @@ as the main playground for developing new features for the [ggml](https://github
 - [X] [Aquila-7B](https://huggingface.co/BAAI/Aquila-7B) / [AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
 - [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
 - [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)

 **Bindings:**

@@ -378,7 +377,7 @@ Building the program with BLAS support may lead to some performance improvements

 - #### cuBLAS

-  This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
+  This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
  - Using `make`:
    ```bash
    make LLAMA_CUBLAS=1
@@ -614,18 +613,6 @@ For more information, see [https://huggingface.co/docs/transformers/perplexity](
 The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
 The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.

-#### How to run
-
-1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
-2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
-3. Output:
-```
-perplexity : calculating perplexity over 655 chunks
-24.43 seconds per pass - ETA 4.45 hours
-[1]4.5970,[2]5.1807,[3]6.0382,...
-```
-And after 4.45 hours, you will have the final perplexity.
-
 ### Interactive mode

 If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
@@ -788,6 +775,18 @@ If your issue is with model generation quality, then please at least scan the fo
    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

+#### How to run
+
+1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
+2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
+3. Output:
+```
+perplexity : calculating perplexity over 655 chunks
+24.43 seconds per pass - ETA 4.45 hours
+[1]4.5970,[2]5.1807,[3]6.0382,...
+```
+And after 4.45 hours, you will have the final perplexity.
+
 ### Android

 #### Building the Project using Android NDK
--- a/build.zig
+++ b/build.zig
@@ -111,14 +111,12 @@ pub fn build(b: *std.build.Builder) !void {
    const common = make.obj("common", "common/common.cpp");
    const console = make.obj("common", "common/console.cpp");
    const grammar_parser = make.obj("grammar-parser", "common/grammar-parser.cpp");
-    const train = make.obj("train", "common/train.cpp");

    _ = make.exe("main", "examples/main/main.cpp", &.{ ggml, ggml_alloc, llama, common, console, grammar_parser });
    _ = make.exe("quantize", "examples/quantize/quantize.cpp", &.{ ggml, ggml_alloc, llama, common });
    _ = make.exe("perplexity", "examples/perplexity/perplexity.cpp", &.{ ggml, ggml_alloc, llama, common });
    _ = make.exe("embedding", "examples/embedding/embedding.cpp", &.{ ggml, ggml_alloc, llama, common });
-    _ = make.exe("finetune", "examples/finetune/finetune.cpp", &.{ ggml, ggml_alloc, llama, common, train });
-    _ = make.exe("train-text-from-scratch", "examples/train-text-from-scratch/train-text-from-scratch.cpp", &.{ ggml, ggml_alloc, llama, common, train });
+    _ = make.exe("train-text-from-scratch", "examples/train-text-from-scratch/train-text-from-scratch.cpp", &.{ ggml, ggml_alloc, llama, common });

    const server = make.exe("server", "examples/server/server.cpp", &.{ ggml, ggml_alloc, llama, common, grammar_parser });
    if (server.target.isWindows()) {
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -167,10 +167,8 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
                invalid_param = true;
                break;
            }
-            // store the external file name in params
-            params.prompt_file = argv[i];
            std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
-            if (!params.prompt.empty() && params.prompt.back() == '\n') {
+            if (params.prompt.back() == '\n') {
                params.prompt.pop_back();
            }
        } else if (arg == "-n" || arg == "--n-predict") {
@@ -295,7 +293,7 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
                break;
            }
            std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.cfg_negative_prompt));
-            if (!params.cfg_negative_prompt.empty() && params.cfg_negative_prompt.back() == '\n') {
+            if (params.cfg_negative_prompt.back() == '\n') {
                params.cfg_negative_prompt.pop_back();
            }
        } else if (arg == "--cfg-scale") {
@@ -1022,11 +1020,10 @@ llama_token llama_sample_token(
            id = llama_sample_token_mirostat_v2(ctx, &cur_p, mirostat_tau, mirostat_eta, &mirostat_mu);
        } else {
            // Temperature sampling
-            size_t min_keep = std::max(1, params.n_probs);
-            llama_sample_top_k      (ctx, &cur_p, top_k, min_keep);
-            llama_sample_tail_free  (ctx, &cur_p, tfs_z, min_keep);
-            llama_sample_typical    (ctx, &cur_p, typical_p, min_keep);
-            llama_sample_top_p      (ctx, &cur_p, top_p, min_keep);
+            llama_sample_top_k      (ctx, &cur_p, top_k, 1);
+            llama_sample_tail_free  (ctx, &cur_p, tfs_z, 1);
+            llama_sample_typical    (ctx, &cur_p, typical_p, 1);
+            llama_sample_top_p      (ctx, &cur_p, top_p, 1);
            llama_sample_temp(ctx, &cur_p, temp);

            {
--- a/common/common.h
+++ b/common/common.h
@@ -79,7 +79,6 @@ struct gpt_params {
    std::string model_draft       = "";                              // draft model for speculative decoding
    std::string model_alias       = "unknown"; // model alias
    std::string prompt            = "";
-    std::string prompt_file       = "";  // store the external prompt file name
    std::string path_prompt_cache = "";  // path to file for saving/loading prompt eval state
    std::string input_prefix      = "";  // string to prefix user inputs with
    std::string input_suffix      = "";  // string to suffix user inputs with
--- a/convert-persimmon-to-gguf.py
+++ b/convert-persimmon-to-gguf.py
@@ -1,130 +0,0 @@
-import torch
-import os
-from pprint import pprint
-import sys
-import argparse
-from pathlib import Path
-from sentencepiece import SentencePieceProcessor
-if 'NO_LOCAL_GGUF' not in os.environ:
-    sys.path.insert(1, str(Path(__file__).parent / 'gguf-py' / 'gguf'))
-import gguf
-
-def _flatten_dict(dct, tensors, prefix=None):
-    assert isinstance(dct, dict)
-    for key in dct.keys():
-        new_prefix = prefix + '.' + key if prefix is not None else key
-        if isinstance(dct[key], torch.Tensor):
-            tensors[new_prefix] = dct[key]
-        elif isinstance(dct[key], dict):
-            _flatten_dict(dct[key], tensors, new_prefix)
-        else:
-            raise ValueError(type(dct[key]))
-    return None
-
-def _get_sentencepiece_tokenizer_info(dir_model: Path):
-    tokenizer_path = dir_model / 'adept_vocab.model'
-    print('gguf: getting sentencepiece tokenizer from', tokenizer_path)
-    tokenizer = SentencePieceProcessor(str(tokenizer_path))
-    print('gguf: adding tokens')
-    tokens: list[bytes] = []
-    scores: list[float] = []
-    toktypes: list[int] = []
-
-    for i in range(tokenizer.vocab_size()):
-        text: bytes
-        score: float
-
-        piece = tokenizer.id_to_piece(i)
-        text = piece.encode("utf-8")
-        score = tokenizer.get_score(i)
-
-        toktype = 1
-        if tokenizer.is_unknown(i):
-            toktype = 2
-        if tokenizer.is_control(i):
-            toktype = 3
-        if tokenizer.is_unused(i):
-            toktype = 5
-        if tokenizer.is_byte(i):
-            toktype = 6
-
-        tokens.append(text)
-        scores.append(score)
-        toktypes.append(toktype)
-        pass
-    return tokens, scores, toktypes
-
-def main():
-    parser = argparse.ArgumentParser(description="Convert a Persimmon model from Adept (e.g. Persimmon 8b chat) to a GGML compatible file")
-    parser.add_argument("--outfile",             type=Path, help="path to write to; default: based on input")
-    parser.add_argument("--ckpt-path",           type=Path, help="path to persimmon checkpoint .pt file")
-    parser.add_argument("--model-dir",           type=Path, help="directory containing model e.g. 8b_chat_model_release")
-    parser.add_argument("--adept-inference-dir", type=str, help="path to adept-inference code directory")
-    args = parser.parse_args()
-    sys.path.append(str(args.adept_inference_dir))
-    persimmon_model = torch.load(args.ckpt_path)
-    hparams = persimmon_model['args']
-    pprint(hparams)
-    tensors = {}
-    _flatten_dict(persimmon_model['model'], tensors, None)
-
-    arch = gguf.MODEL_ARCH.PERSIMMON
-    gguf_writer = gguf.GGUFWriter(args.outfile, gguf.MODEL_ARCH_NAMES[arch])
-
-    block_count = hparams.num_layers
-    head_count = hparams.num_attention_heads
-    head_count_kv = head_count
-    ctx_length = hparams.seq_length
-    hidden_size = hparams.hidden_size
-
-    gguf_writer.add_name('persimmon-8b-chat')
-    gguf_writer.add_context_length(ctx_length)
-    gguf_writer.add_embedding_length(hidden_size)
-    gguf_writer.add_block_count(block_count)
-    gguf_writer.add_feed_forward_length(hparams.ffn_hidden_size)
-    gguf_writer.add_rope_dimension_count(hidden_size // head_count)
-    gguf_writer.add_head_count(head_count)
-    gguf_writer.add_head_count_kv(head_count_kv)
-    gguf_writer.add_rope_freq_base(hparams.rotary_emb_base)
-    gguf_writer.add_layer_norm_eps(hparams.layernorm_epsilon)
-
-    tokens, scores, toktypes = _get_sentencepiece_tokenizer_info(args.model_dir)
-    gguf_writer.add_tokenizer_model('llama')
-    gguf_writer.add_token_list(tokens)
-    gguf_writer.add_token_scores(scores)
-    gguf_writer.add_token_types(toktypes)
-    gguf_writer.add_bos_token_id(71013)
-    gguf_writer.add_eos_token_id(71013)
-
-    tensor_map = gguf.get_tensor_name_map(arch, block_count)
-    print(tensor_map)
-    for name in tensors.keys():
-        data = tensors[name]
-        if name.endswith(".self_attention.rotary_emb.inv_freq"):
-            continue
-        old_dtype = data.dtype
-        # TODO: FP16 conversion produces garbage outputs. (Q8_0 does not, so..?)
-        data = data.to(torch.float32).squeeze().numpy()
-        new_name = tensor_map.get_name(name, try_suffixes = (".weight", ".bias"))
-        if new_name is None:
-            print("Can not map tensor '" + name + "'")
-            sys.exit()
-        n_dims = len(data.shape)
-        print(new_name + ", n_dims = " + str(n_dims) + ", " + str(old_dtype) + " --> " + str(data.dtype))
-        gguf_writer.add_tensor(new_name, data)
-    print("gguf: write header")
-    gguf_writer.write_header_to_file()
-    print("gguf: write metadata")
-    gguf_writer.write_kv_data_to_file()
-    print("gguf: write tensors")
-    gguf_writer.write_tensors_to_file()
-
-    gguf_writer.close()
-
-    print(f"gguf: model successfully exported to '{args.outfile}'")
-    print("")
-
-
-
-if __name__ == '__main__':
-    main()
--- a/convert-refact-hf-to-gguf.py
+++ b/convert-refact-hf-to-gguf.py
@@ -17,6 +17,33 @@ if "NO_LOCAL_GGUF" not in os.environ:
    sys.path.insert(1, str(Path(__file__).parent / "gguf-py" / "gguf"))
 import gguf

+
+def bytes_to_unicode():
+    # ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a significant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = (
+        list(range(ord("!"), ord("~") + 1))
+        + list(range(ord("¡"), ord("¬") + 1))
+        + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    return dict(zip(bs, (chr(n) for n in cs)))
+
+
 def count_model_parts(dir_model: Path) -> int:
    num_parts = 0
    for filename in os.listdir(dir_model):
@@ -126,25 +153,53 @@ tokens: list[bytearray] = []
 scores: list[float] = []
 toktypes: list[int] = []

+tokenizer_json_file = dir_model / "tokenizer.json"
+if not tokenizer_json_file.is_file():
+    print(f"Error: Missing {tokenizer_json_file}", file=sys.stderr)
+    sys.exit(1)
+
 # gpt2 tokenizer
 gguf_writer.add_tokenizer_model("gpt2")

-print("gguf: get gpt2 tokenizer vocab")
+with open(tokenizer_json_file, "r", encoding="utf-8") as f:
+    tokenizer_json = json.load(f)

-# ref: https://github.com/cmp-nct/ggllm.cpp/blob/master/falcon_convert.py
-tokenizer = AutoTokenizer.from_pretrained(dir_model)
+print("gguf: get gpt2 tokenizer vocab")

 # The number of tokens in tokenizer.json can differ from the expected vocab size.
 # This causes downstream issues with mismatched tensor sizes when running the inference
-vocab_size = hparams.get("vocab_size", len(tokenizer.vocab))
-assert max(tokenizer.vocab.values()) < vocab_size
+vocab_size = (
+    hparams["vocab_size"]
+    if "vocab_size" in hparams
+    else len(tokenizer_json["model"]["vocab"])
+)
+
+tokenizer = AutoTokenizer.from_pretrained(dir_model, trust_remote_code=True)

 reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
+byte_encoder = bytes_to_unicode()
+byte_decoder = {v: k for k, v in byte_encoder.items()}

 for i in range(vocab_size):
-    tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
-    scores.append(0.0) # dummy
-    toktypes.append(gguf.TokenType.NORMAL)
+    if i in reverse_vocab:
+        text = reverse_vocab[i]
+        try:
+            text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
+        except KeyError:
+            text = bytearray()
+            for c in reverse_vocab[i]:
+                if ord(c) < 256:  # single byte character
+                    text.append(byte_decoder[ord(c)])
+                else:  # multibyte special token character
+                    text.extend(c.encode("utf-8"))
+    else:
+        print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
+        pad_token = f"[PAD{i}]".encode("utf8")
+        text = bytearray(pad_token)
+
+    tokens.append(text)
+    scores.append(0.0)  # dymmy
+    toktypes.append(gguf.TokenType.NORMAL)  # dummy

 gguf_writer.add_token_list(tokens)
 gguf_writer.add_token_scores(scores)
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -35,6 +35,7 @@ else()
    endif()
    if (LLAMA_BUILD_SERVER)
        add_subdirectory(server)
+        add_subdirectory(server-parallel)
    endif()
    add_subdirectory(export-lora)
 endif()
--- a/examples/jeopardy/README.md
+++ b/examples/jeopardy/README.md
@@ -2,7 +2,7 @@

 This is pretty much just a straight port of aigoopy/llm-jeopardy/ with an added graph viewer.

-The jeopardy test can be used to compare the fact knowledge of different models and compare them to each other. This is in contrast to some other tests, which test logical deduction, creativity, writing skills, etc.
+The jeopardy test can be used to compare the fact knowledge of different models and compare them to eachother. This is in contrast to some other tests, which test logical deduction, creativity, writing skills, etc.


 Step 1: Open jeopardy.sh and modify the following:
--- a/examples/parallel/parallel.cpp
+++ b/examples/parallel/parallel.cpp
@@ -10,7 +10,6 @@
 #include <cstdio>
 #include <string>
 #include <vector>
-#include <ctime>

 // trim whitespace from the beginning and end of a string
 static std::string trim(const std::string & str) {
@@ -71,26 +70,6 @@ struct client {
    std::vector<llama_token> tokens_prev;
 };

-static void print_date_time() {
-    std::time_t current_time = std::time(nullptr);
-    std::tm* local_time = std::localtime(&current_time);
-    char buffer[80];
-    strftime(buffer, sizeof(buffer), "%Y-%m-%d %H:%M:%S", local_time);
-
-    printf("\n\033[35mrun parameters as at %s\033[0m\n", buffer);
-}
-
-// Define a split string function to ...
-static std::vector<std::string> split_string(const std::string& input, char delimiter) {
-    std::vector<std::string> tokens;
-    std::istringstream stream(input);
-    std::string token;
-    while (std::getline(stream, token, delimiter)) {
-        tokens.push_back(token);
-    }
-    return tokens;
-}
-
 int main(int argc, char ** argv) {
    srand(1234);

@@ -125,23 +104,6 @@ int main(int argc, char ** argv) {
    params.logits_all = true;
    std::tie(model, ctx) = llama_init_from_gpt_params(params);

-    // load the prompts from an external file if there are any
-    if (params.prompt.empty()) {
-        printf("\n\033[32mNo new questions so proceed with build-in defaults.\033[0m\n");
-    } else {
-        // Output each line of the input params.prompts vector and copy to k_prompts
-        int index = 0;
-        printf("\n\033[32mNow printing the external prompt file %s\033[0m\n\n", params.prompt_file.c_str());
-
-        std::vector<std::string> prompts = split_string(params.prompt, '\n');
-        for (const auto& prompt : prompts) {
-            k_prompts.resize(index + 1);
-            k_prompts[index] = prompt;
-            index++;
-            printf("%3d prompt: %s\n", index, prompt.c_str());
-        }
-    }
-
    fprintf(stderr, "\n\n");
    fflush(stderr);

@@ -167,7 +129,7 @@ int main(int argc, char ** argv) {

    // the max batch size is as large as the context to handle cases where we get very long input prompt from multiple
    // users. regardless of the size, the main loop will chunk the batch into a maximum of params.n_batch tokens at a time
-    llama_batch batch = llama_batch_init(n_ctx, 0);
+    llama_batch batch = llama_batch_init(params.n_ctx, 0);

    int32_t n_total_prompt = 0;
    int32_t n_total_gen    = 0;
@@ -271,7 +233,7 @@ int main(int argc, char ** argv) {
                    client.n_decoded = 0;
                    client.i_batch   = batch.n_tokens - 1;

-                    LOG_TEE("\033[31mClient %3d, seq %4d, started decoding ...\033[0m\n", client.id, client.seq_id);
+                    LOG_TEE("\033[1mClient %3d, seq %4d, started decoding ...\033[0m\n", client.id, client.seq_id);

                    g_seq_id += 1;

@@ -374,8 +336,8 @@ int main(int argc, char ** argv) {

                    const auto t_main_end = ggml_time_us();

-                    LOG_TEE("\033[31mClient %3d, seq %3d/%3d, prompt %4d t, response %4d t, time %5.2f s, speed %5.2f t/s, cache miss %d \033[0m \nInput:    %s\n\033[35mResponse: %s\033[0m\n\n",
-                            client.id, client.seq_id, n_seq, client.n_prompt, client.n_decoded,
+                    LOG_TEE("\033[1mClient %3d, seq %4d, prompt %4d t, response %4d t, time %5.2f s, speed %5.2f t/s, cache miss %d \033[0m \n\nInput:    %s\nResponse: %s\n\n",
+                            client.id, client.seq_id, client.n_prompt, client.n_decoded,
                            (t_main_end - client.t_start_prompt) / 1e6,
                            (double) (client.n_prompt + client.n_decoded) / (t_main_end - client.t_start_prompt) * 1e6,
                            n_cache_miss,
@@ -395,21 +357,13 @@ int main(int argc, char ** argv) {

    const auto t_main_end = ggml_time_us();

-    print_date_time();
-
-    LOG_TEE("\n%s: n_parallel = %d, n_sequences = %d, cont_batching = %d, system tokens = %d\n", __func__, n_clients, n_seq, cont_batching, n_tokens_system);
-    if (params.prompt_file.empty()) {
-        params.prompt_file = "used built-in defaults";
-    }
-    LOG_TEE("External prompt file: \033[32m%s\033[0m\n", params.prompt_file.c_str());
-    LOG_TEE("Model and path used:  \033[32m%s\033[0m\n\n", params.model.c_str());
-
+    LOG_TEE("\n\n");
    LOG_TEE("Total prompt tokens: %6d, speed: %5.2f t/s\n", n_total_prompt, (double) (n_total_prompt              ) / (t_main_end - t_main_start) * 1e6);
    LOG_TEE("Total gen tokens:    %6d, speed: %5.2f t/s\n", n_total_gen,    (double) (n_total_gen                 ) / (t_main_end - t_main_start) * 1e6);
    LOG_TEE("Total speed (AVG):   %6s  speed: %5.2f t/s\n", "",             (double) (n_total_prompt + n_total_gen) / (t_main_end - t_main_start) * 1e6);
    LOG_TEE("Cache misses:        %6d\n", n_cache_miss);

-    LOG_TEE("\n");
+    LOG_TEE("\n\n");

    llama_print_timings(ctx);

--- a/examples/server-parallel/CMakeLists.txt
+++ b/examples/server-parallel/CMakeLists.txt
@@ -0,0 +1,15 @@
+set(TARGET server-parallel)
+include_directories(${CMAKE_CURRENT_SOURCE_DIR})
+add_executable(${TARGET} server.cpp ../server/json.hpp ../server/httplib.h)
+install(TARGETS ${TARGET} RUNTIME)
+target_compile_definitions(${TARGET} PRIVATE
+    SERVER_VERBOSE=$<BOOL:${LLAMA_SERVER_VERBOSE}>
+)
+target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
+if (WIN32)
+    TARGET_LINK_LIBRARIES(${TARGET} PRIVATE ws2_32)
+endif()
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if(TARGET BUILD_INFO)
+  add_dependencies(${TARGET} BUILD_INFO)
+endif()
--- a/examples/server-parallel/README.md
+++ b/examples/server-parallel/README.md
@@ -0,0 +1,66 @@
+# llama.cpp/example/server-parallel
+
+This example demonstrates a PoC HTTP API server that handles simulataneus requests. Long prompts are not supported.
+
+## Quick Start
+
+To get started right away, run the following command, making sure to use the correct path for the model you have:
+
+### Unix-based systems (Linux, macOS, etc.):
+
+```bash
+./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
+```
+
+### Windows:
+
+```powershell
+server-parallel.exe -m models\7B\ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching
+```
+The above command will start a server that by default listens on `127.0.0.1:8080`.
+
+## API Endpoints
+
+-   **GET** `/props`: Return the user and assistant name for generate the prompt.
+
+*Response:*
+```json
+{
+    "user_name": "User:",
+    "assistant_name": "Assistant:"
+}
+```
+
+-   **POST** `/completion`: Given a prompt, it returns the predicted completion, just streaming mode.
+
+    *Options:*
+
+    `temperature`: Adjust the randomness of the generated text (default: 0.1).
+
+    `prompt`: Provide a prompt as a string, It should be a coherent continuation of the system prompt.
+
+    `system_prompt`: Provide a system prompt as a string.
+
+    `anti_prompt`: Provide the name of the user coherent with the system prompt.
+
+    `assistant_name`: Provide the name of the assistant coherent with the system prompt.
+
+*Example request:*
+```json
+{
+    "system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nHuman: Hello\nAssistant: Hi, how may I help you?\nHuman:",
+    "anti_prompt": "Human:",
+    "assistant_name": "Assistant:",
+    "prompt": "When is the day of independency of US?",
+    "temperature": 0.2
+}
+```
+
+*Response:*
+```json
+{
+    "content": "<token_str>"
+}
+```
+
+# This example is a Proof of Concept, have some bugs and unexpected behaivors, this not supports long prompts.
--- a/examples/server-parallel/frontend.h
+++ b/examples/server-parallel/frontend.h
@@ -0,0 +1,263 @@
+const char* system_prompt_default =
+R"(Transcript of a never ending dialog, where the User interacts with an Assistant.
+The Assistant is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
+User: Recommend a nice restaurant in the area.
+Assistant: I recommend the restaurant "The Golden Duck". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.
+User: Who is Richard Feynman?
+Assistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including "Surely You're Joking, Mr. Feynman!" and "What Do You Care What Other People Think?".
+User:)";
+
+const char* index_html_ = R"(
+<!DOCTYPE html>
+<html>
+<head>
+  <title>llama.cpp - server parallel PoC</title>
+  <script src="index.js"></script>
+</head>
+<body>
+  <div style="width: 90%;margin: auto;">
+    <h2>Server parallel - PoC</h2>
+    <form id="myForm">
+      <input type="checkbox" id="system_promt_cb" name="myCheckbox" onchange="toggleSP() ">
+      <label for="system_promt_cb">Use custom system prompt</label>
+      <br>
+      <div id="system_prompt_view" style="display: none;">
+        <textarea id="sp_text" name="systemPrompt" style="width: 100%;height: 4rem;" placeholder="System Prompt"></textarea>
+        <label for="user_name">User name</label>
+        <input type="text" id="user_name" value=""  placeholder="Anti prompt" required>
+        <label for="assistant_name">Assistant name</label>
+        <input type="text" id="assistant_name" value="" placeholder="Assistant:" required>
+        <button type="button" id="btn_reset" onclick="clearSP() " >Clear all</button>
+      </div>
+      <br>
+      <label for="slot_id">Slot ID (-1 load in a idle slot)</label>
+      <input type="number" id="slot_id" value="-1" required>
+      <br>
+      <label for="temperature">Temperature</label>
+      <input type="number" id="temperature" value="0.1" required>
+      <br>
+      <label for="message">Message</label>
+      <input id="message" style="width: 80%;" required>
+      <br><br>
+      <button type="button" id="btn_send" onclick="perform() " >Send</button>
+      <br>
+      <br>
+      <button type="button" id="btn_reset" onclick="resetBtn() " >Reset</button>
+    </form>
+    <div id="conversation_view">
+    </div>
+  </div>
+</body>
+</html>
+)";
+
+const char* index_js_ = R"(
+ let conversation = [];
+ let current_message = -1;
+
+const questions = [
+  "Who is Elon Musk?",
+  "Who is Jeff Bezos?",
+  "How to get a job at google?",
+  "What are you?",
+  "When was born Abraham Lincoln?",
+];
+
+let user_name = "";
+let assistant_name = "";
+
+function toggleSP() {
+    if(document.getElementById("system_promt_cb").checked) {
+        document.getElementById("system_prompt_view").style.display = "block";
+    } else {
+        document.getElementById("system_prompt_view").style.display = "none";
+    }
+}
+
+function clearSP() {
+    document.getElementById("sp_text").value = "";
+    document.getElementById("anti_prompt").value = "";
+    document.getElementById("assistant_name").value = "";
+}
+
+docReady(async () => {
+  document.getElementById("message").value =
+    questions[Math.floor(Math.random() * questions.length)];
+
+    // to keep the same prompt format in all clients
+    const response = await fetch("/props");
+    if (!response.ok) {
+      alert(`HTTP error! Status: ${response.status}`);
+    }
+    const data = await response.json();
+    user_name = data.user_name;
+    assistant_name = data.assistant_name;
+});
+
+function docReady(fn) {
+  // see if DOM is already available
+  if (
+    document.readyState === "complete" ||
+    document.readyState === "interactive"
+  ) {
+    // call on next available tick
+    setTimeout(fn, 1);
+  } else {
+    document.addEventListener("DOMContentLoaded", fn);
+  }
+}
+
+function updateView() {
+  let conv_view = document.getElementById("conversation_view");
+  // build view
+  conv_view.innerHTML = "";
+  for (let index in conversation) {
+    conversation[index].assistant = conversation[index].assistant.replace(
+        user_name,
+      ""
+    );
+    conv_view.innerHTML += `
+          <p><span style="font-weight: bold">User:</span> ${conversation[index].user}<p>
+          <p style="white-space: pre-line;"><span style="font-weight: bold">Assistant:</span> ${conversation[index].assistant}<p>`;
+  }
+}
+
+async function call_llama(options) {
+  const response = await fetch("/completion", {
+    method: "POST",
+    body: JSON.stringify(options),
+    headers: {
+      Connection: "keep-alive",
+      "Content-Type": "application/json",
+      Accept: "text/event-stream",
+    },
+  });
+
+  const reader = response.body.getReader();
+  let cont = true;
+  const decoder = new TextDecoder();
+  let leftover = ""; // Buffer for partially read lines
+
+  try {
+    let cont = true;
+
+    while (cont) {
+      const result = await reader.read();
+      if (result.done) {
+        document.getElementById("btn_send").disabled = false;
+        break;
+      }
+
+      // Add any leftover data to the current chunk of data
+      const text = leftover + decoder.decode(result.value);
+
+      // Check if the last character is a line break
+      const endsWithLineBreak = text.endsWith("\n");
+
+      // Split the text into lines
+      let lines = text.split("\n");
+
+      // If the text doesn't end with a line break, then the last line is incomplete
+      // Store it in leftover to be added to the next chunk of data
+      if (!endsWithLineBreak) {
+        leftover = lines.pop();
+      } else {
+        leftover = ""; // Reset leftover if we have a line break at the end
+      }
+
+      // Parse all sse events and add them to result
+      const regex = /^(\S+):\s(.*)$/gm;
+      for (const line of lines) {
+        const match = regex.exec(line);
+        if (match) {
+          result[match[1]] = match[2];
+          // since we know this is llama.cpp, let's just decode the json in data
+          if (result.data) {
+            result.data = JSON.parse(result.data);
+            conversation[current_message].assistant += result.data.content;
+            updateView();
+          }
+        }
+      }
+    }
+  } catch (e) {
+    if (e.name !== "AbortError") {
+      console.error("llama error: ", e);
+    }
+    throw e;
+  }
+}
+
+function generatePrompt() {
+  // generate a good prompt to have coherence
+  let prompt = "";
+  for (let index in conversation) {
+    if (index == 0) {
+      prompt += conversation[index].user + "\n";
+    } else {
+      prompt += user_name + conversation[index].user + "\n";
+    }
+    if (index == current_message) {
+      prompt += assistant_name;
+    } else {
+      prompt += assistant_name + conversation[index].assistant;
+    }
+  }
+  return prompt;
+}
+
+function resetBtn() {
+  document.getElementById("slot_id").value = "-1";
+  document.getElementById("temperature").value = "0.1";
+  document.getElementById("message").value =
+    questions[Math.floor(Math.random() * questions.length)];
+  document.getElementById("conversation_view").innerHTML = "";
+  conversation = [];
+  current_message = -1;
+}
+
+async function perform() {
+  var slot_id = parseInt(document.getElementById("slot_id").value);
+  var temperature = parseFloat(document.getElementById("temperature").value);
+  var prompt = " " + document.getElementById("message").value;
+  if (!isNaN(slot_id) && !isNaN(temperature) && prompt.length > 0) {
+    let options = {
+        slot_id,
+        temperature
+    };
+    if(document.getElementById("system_promt_cb").checked) {
+        let system_prompt = document.getElementById("sp_text").value;
+        let anti_prompt = document.getElementById("user_name").value;
+        let assistant_name_ = document.getElementById("assistant_name").value;
+        if(!system_prompt || !anti_prompt || !assistant_name_) {
+          document.getElementById("conversation_view").innerText =
+                "please, insert valid props.";
+          return;
+        }
+        conversation = [];
+        current_message = -1;
+        document.getElementById("system_promt_cb").checked = false;
+        document.getElementById("system_promt_cb").dispatchEvent(new Event("change"));
+        options.system_prompt = system_prompt;
+        options.anti_prompt = anti_prompt;
+        options.assistant_name = assistant_name_;
+        user_name = anti_prompt;
+        assistant_name = assistant_name_;
+    }
+    current_message++;
+    conversation.push({
+      user: prompt,
+      assistant: "",
+    });
+    updateView();
+    document.getElementById("message").value = "";
+    document.getElementById("btn_send").disabled = true;
+    options.prompt = generatePrompt();
+    await call_llama(options);
+  } else {
+    document.getElementById("conversation_view").innerText =
+      "please, insert valid props.";
+  }
+}
+
+)";
--- a/examples/server-parallel/server.cpp
+++ b/examples/server-parallel/server.cpp
@@ -0,0 +1,884 @@
+#include "frontend.h"
+#include "common.h"
+#include "llama.h"
+
+#include "../server/httplib.h"
+#include "../server/json.hpp"
+
+#include <iostream>
+#include <sstream>
+#include <thread>
+#include <vector>
+#include <chrono>
+
+using namespace httplib;
+using namespace std;
+using namespace nlohmann;
+
+struct server_params
+{
+    std::string hostname = "127.0.0.1";
+    std::string public_path = "examples/server/public";
+    int32_t port = 8080;
+    int32_t read_timeout = 600;
+    int32_t write_timeout = 600;
+};
+
+// utils functions taken of examples/server
+
+static bool ends_with(const std::string &str, const std::string &suffix)
+{
+    return str.size() >= suffix.size() &&
+           0 == str.compare(str.size() - suffix.size(), suffix.size(), suffix);
+}
+
+static size_t find_partial_stop_string(const std::string &stop,
+                                       const std::string &text)
+{
+    if (!text.empty() && !stop.empty())
+    {
+        const char text_last_char = text.back();
+        for (int64_t char_index = stop.size() - 1; char_index >= 0; char_index--)
+        {
+            if (stop[char_index] == text_last_char)
+            {
+                const std::string current_partial = stop.substr(0, char_index + 1);
+                if (ends_with(text, current_partial))
+                {
+                    return text.size() - char_index - 1;
+                }
+            }
+        }
+    }
+    return std::string::npos;
+}
+
+enum stop_type
+{
+    STOP_FULL,
+    STOP_PARTIAL,
+};
+
+enum slot_state
+{
+    IDLE,
+    PROCESSING
+};
+
+enum slot_command {
+    NONE,
+    LOAD_PROMPT,
+    RELEASE
+};
+
+
+struct llama_client_slot
+{
+    int id;
+    int32_t n_prompt  = 0;
+    int32_t n_decoded = 0;
+    int32_t i_batch   = -1;
+    string prompt = "";
+    string sampled_token_str;
+    string generated_text = "";
+    llama_token sampled;
+    std::vector<llama_token> tokens_prev;
+    slot_state state = IDLE;
+    slot_command command = NONE;
+    bool newToken = false;
+    float temperature = 0.1f;
+
+    void start(string prompt_, float temp_) {
+        prompt = prompt_;
+        command = LOAD_PROMPT;
+        temperature = temp_;
+        newToken = false;
+    }
+
+    bool hasNewToken() {
+        if(newToken) {
+            newToken = false;
+            return true;
+        }
+        return false;
+    }
+
+    bool available() {
+        return state == IDLE && command == NONE;
+    }
+
+    void nofity() {
+        newToken = !newToken;
+    }
+
+    void release() {
+        if(state == PROCESSING) {
+            command = RELEASE;
+        }
+    }
+};
+
+struct server_parallel_context {
+    // example props
+    vector<llama_client_slot> slots;
+    std::string system_prompt = "";
+    bool update_system_prompt = true;
+
+    // broadcast to all clients to keep the same prompt format
+    std::string user_name = ""; // this should be the anti prompt
+    std::string assistant_name = ""; // this is for generate the prompt
+
+    // llama native props
+    gpt_params params;
+    llama_model *model = NULL;
+    llama_context *ctx = NULL;
+    int n_ctx;
+    int n_vocab;
+    std::vector<llama_token_data> candidates;
+    std::vector<llama_token> tokens_system;
+    int32_t n_tokens_system = 0;
+    llama_batch batch;
+
+    bool loadModel(gpt_params params_) {
+        params = params_;
+        std::tie(model, ctx) = llama_init_from_gpt_params(params);
+        if (model == nullptr)
+        {
+            LOG_TEE("unable to load model: %s", params.model.c_str());
+            return false;
+        }
+        n_ctx = llama_n_ctx(ctx);
+        n_vocab = llama_n_vocab(model);
+        candidates.reserve(n_vocab);
+        return true;
+    }
+
+    void initialize() {
+        // create slots
+        LOG_TEE("Available slots:\n");
+        for (int i = 0; i < params.n_parallel; i++)
+        {
+            llama_client_slot slot;
+            slot.id = i;
+            slot.prompt = "default";
+            slot.state = IDLE;
+            slot.tokens_prev.resize(std::max(256, params.n_predict));
+            std::fill(slot.tokens_prev.begin(), slot.tokens_prev.end(), 0);
+            LOG_TEE(" - slot %i\n", slot.id);
+            slots.push_back(slot);
+        }
+        batch = llama_batch_init(params.n_ctx, 0);
+
+        // always assign a default system prompt
+        system_prompt = system_prompt_default;
+        user_name = "User:";
+        assistant_name = "Assistant:";
+        params.antiprompt.push_back(user_name);
+    }
+
+    void updateSystemPrompt() {
+        tokens_system = ::llama_tokenize(ctx, system_prompt, true);
+        n_tokens_system = tokens_system.size();
+
+        batch.n_tokens = n_tokens_system;
+
+        // clear the entire KV cache
+        for (int i = 0; i < params.n_parallel; ++i)
+        {
+            llama_kv_cache_seq_rm(ctx, i, 0, -1);
+        }
+
+        for (int32_t i = 0; i < batch.n_tokens; ++i)
+        {
+            batch.token[i] = tokens_system[i];
+            batch.pos[i] = i;
+            batch.seq_id[i] = 0;
+            batch.logits[i] = false;
+        }
+
+        if (llama_decode(ctx, batch) != 0)
+        {
+            LOG_TEE("%s: llama_decode() failed\n", __func__);
+            return;
+        }
+
+        // assign the system KV cache to all parallel sequences
+        for (int32_t i = 1; i < params.n_parallel; ++i)
+        {
+            llama_kv_cache_seq_cp(ctx, 0, i, 0, n_tokens_system);
+        }
+
+        LOG_TEE("system prompt updated\n");
+        update_system_prompt = false;
+    }
+
+    void notifySystemPromptChanged() {
+        // release all slots
+        for (llama_client_slot &slot : slots)
+        {
+            slot.release();
+        }
+        waitAllAreIdle();
+        // wait until system prompt load
+        update_system_prompt = true;
+        while(update_system_prompt) {
+            this_thread::sleep_for(chrono::milliseconds(5));
+        }
+        // system prompt loaded, continue
+    }
+
+    llama_client_slot* requestCompletion(json data) {
+        if(data.contains("system_prompt") &&
+            data.contains("anti_prompt") &&
+            data.contains("assistant_name")) {
+            system_prompt = data.value("system_prompt", "");
+            user_name = data.value("anti_prompt", "");
+            assistant_name = data.value("assistant_name", "");
+            params.antiprompt.clear();
+            params.antiprompt.push_back(user_name);
+            notifySystemPromptChanged();
+        }
+        int slot_id = data.value("slot_id", -1);
+        float temperature = data.value("temperature", 0.1f);
+        string prompt = data.value("prompt", "");
+        for (llama_client_slot & slot : slots)
+        {
+            if ((slot_id == -1 && slot.available()) || slot.id == slot_id)
+            {
+                slot.start(prompt, temperature);
+                LOG_TEE("slot %i is processing\n", slot.id);
+                return &slot; // return a pointer to slot (thread safe?)
+            }
+        }
+        return nullptr;
+    }
+
+    size_t findStoppingStrings(const std::string &text, const size_t last_token_size,
+                               const stop_type type)
+    {
+        size_t stop_pos = std::string::npos;
+        for (const std::string &word : params.antiprompt)
+        {
+            size_t pos;
+            if (type == STOP_FULL)
+            {
+                const size_t tmp = word.size() + last_token_size;
+                const size_t from_pos = text.size() > tmp ? text.size() - tmp : 0;
+                pos = text.find(word, from_pos);
+            }
+            else
+            {
+                pos = find_partial_stop_string(word, text);
+            }
+            if (pos != std::string::npos &&
+                (stop_pos == std::string::npos || pos < stop_pos))
+            {
+                stop_pos = pos;
+            }
+        }
+        return stop_pos;
+    }
+
+    void waitAllAreIdle() {
+        bool wait = true;
+        while(wait) {
+            wait = false;
+            for (auto &slot : slots)
+            {
+                if (!slot.available())
+                {
+                    wait = true;
+                    break;
+                }
+            }
+        }
+    }
+
+    bool updateSlots() {
+        // update the system prompt wait until all slots are idle state
+        if(update_system_prompt) {
+            updateSystemPrompt();
+        }
+
+        batch.n_tokens = 0;
+
+        // decode any currently ongoing sequences
+        for (auto & slot : slots) {
+            if (slot.state == PROCESSING && slot.command == RELEASE)
+            {
+                LOG_TEE("slot %i released\n", slot.id);
+                llama_kv_cache_seq_rm(ctx, slot.id, n_tokens_system, n_ctx);
+                slot.state = IDLE;
+                slot.command = NONE;
+                continue;
+            }
+
+            // no decode wait until the token had been send to client
+            // improves performance and avoid decoherence?
+
+            if (slot.state == IDLE || slot.newToken) {
+                continue;
+            }
+
+            batch.token [batch.n_tokens] = slot.sampled;
+            batch.pos   [batch.n_tokens] = n_tokens_system + slot.n_prompt + slot.n_decoded;
+            batch.seq_id[batch.n_tokens] = slot.id;
+            batch.logits[batch.n_tokens] = true;
+
+            slot.n_decoded += 1;
+            slot.i_batch = batch.n_tokens;
+
+            batch.n_tokens += 1;
+        }
+
+        // assign workload to the slots
+        if (params.cont_batching || batch.n_tokens == 0) {
+            for (llama_client_slot & slot : slots) {
+                // need process the prompt
+                if (slot.state == IDLE && slot.command == LOAD_PROMPT) {
+                    slot.state = PROCESSING;
+                    slot.command = NONE;
+                    //LOG_TEE("slot %i process prompt:\n%s%s'------------------------------\n", slot.id, system_prompt.c_str(), slot.prompt.c_str());
+                    std::fill(slot.tokens_prev.begin(), slot.tokens_prev.end(), 0);
+
+                    // do not prepend BOS because we have a system prompt!
+                    std::vector<llama_token> tokens_prompt;
+                    tokens_prompt = ::llama_tokenize(ctx, slot.prompt, false);
+
+                    for (size_t i = 0; i < tokens_prompt.size(); ++i) {
+                        batch.token [batch.n_tokens] = tokens_prompt[i];
+                        batch.pos   [batch.n_tokens] = i + n_tokens_system;
+                        batch.seq_id[batch.n_tokens] = slot.id;
+                        batch.logits[batch.n_tokens] = false;
+                        batch.n_tokens += 1;
+                    }
+
+                    // extract the logits only for the last token
+                    if (batch.n_tokens > 0) {
+                        batch.logits[batch.n_tokens - 1] = true;
+                    }
+
+                    slot.n_prompt  = tokens_prompt.size();
+                    slot.n_decoded = 0;
+                    slot.i_batch   = batch.n_tokens - 1;
+
+                    // insert new requests one-by-one
+                    //if (cont_batching) {
+                    //    break;
+                    //}
+                }
+            }
+        }
+
+        if (batch.n_tokens == 0) {
+            return true;
+        }
+
+        // process in chunks of params.n_batch
+        int32_t n_batch = params.n_batch;
+
+        for (int32_t i = 0; i < (int32_t) batch.n_tokens; i += n_batch) {
+            // experiment: process in powers of 2
+            //if (i + n_batch > (int32_t) batch.n_tokens && n_batch > 32) {
+            //    n_batch /= 2;
+            //    i -= n_batch;
+            //    continue;
+            //}
+
+            const int32_t n_tokens = std::min(n_batch, (int32_t) (batch.n_tokens - i));
+
+            llama_batch batch_view = {
+                n_tokens,
+                batch.token  + i,
+                nullptr,
+                batch.pos    + i,
+                batch.seq_id + i,
+                batch.logits + i,
+                0, 0, 0, // unused
+            };
+
+            const int ret = llama_decode(ctx, batch_view);
+            if (ret != 0) {
+                if (n_batch == 1 || ret < 0) {
+                    // if you get here, it means the KV cache is full - try increasing it via the context size
+                    LOG_TEE("%s : failed to decode the batch, n_batch = %d, ret = %d\n", __func__, n_batch, ret);
+                    return false;
+                }
+
+                LOG("%s : failed to decode the batch, retrying with n_batch = %d\n", __func__, n_batch / 2);
+
+                // retry with half the batch size to try to find a free slot in the KV cache
+                n_batch /= 2;
+                i -= n_batch;
+                continue;
+            }
+
+            for (auto & slot : slots) {
+                if (slot.i_batch < (int) i || slot.i_batch >= (int) (i + n_tokens)) {
+                    continue;
+                }
+
+                params.temp = slot.temperature;
+                const llama_token id = llama_sample_token(ctx, NULL, NULL, params, slot.tokens_prev, candidates, slot.i_batch - i);
+
+                // remember which tokens were sampled - used for repetition penalties during sampling
+                slot.tokens_prev.erase(slot.tokens_prev.begin());
+                slot.tokens_prev.push_back(id);
+
+                const std::string token_str = llama_token_to_piece(ctx, id);
+                slot.generated_text += token_str;
+                slot.sampled = id;
+
+                size_t stop_pos =
+                        findStoppingStrings(slot.generated_text, token_str.size(), STOP_FULL);
+
+                slot.sampled_token_str = token_str;
+                // notify new token
+                slot.nofity();
+
+                if (slot.n_decoded > 2 &&
+                        (id == llama_token_eos(ctx) ||
+                         (params.n_predict > 0 &&
+                        slot.n_decoded + slot.n_prompt >=
+                        params.n_predict) ||
+                         stop_pos != std::string::npos)) {
+                    //LOG_TEE("slot %i generated text:\n%s'------------------------------\n", slot.id, slot.generated_text.c_str());
+                    slot.generated_text.clear();
+                    slot.release();
+                }
+                slot.i_batch = -1;
+            }
+        }
+        return true;
+    }
+};
+
+static void server_print_usage(const char *argv0, const gpt_params &params,
+                               const server_params &sparams)
+{
+    printf("usage: %s [options]\n", argv0);
+    printf("\n");
+    printf("options:\n");
+    printf("  -h, --help            show this help message and exit\n");
+    printf("  -t N, --threads N     number of threads to use during computation (default: %d)\n", params.n_threads);
+    printf("  -c N, --ctx-size N    size of the prompt context (default: %d)\n", params.n_ctx);
+    printf("  --rope-freq-base N    RoPE base frequency (default: loaded from model)\n");
+    printf("  --rope-freq-scale N   RoPE frequency scaling factor (default: loaded from model)\n");
+    printf("  -b N, --batch-size N  batch size for prompt processing (default: %d)\n", params.n_batch);
+    printf("  --memory-f32          use f32 instead of f16 for memory key+value (default: disabled)\n");
+    printf("                        not recommended: doubles context memory required and no measurable increase in quality\n");
+    if (llama_mlock_supported())
+    {
+        printf("  --mlock               force system to keep model in RAM rather than swapping or compressing\n");
+    }
+    if (llama_mmap_supported())
+    {
+        printf("  --no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
+    }
+    printf("  --numa                attempt optimizations that help on some NUMA systems\n");
+#ifdef LLAMA_SUPPORTS_GPU_OFFLOAD
+    printf("  -ngl N, --n-gpu-layers N\n");
+    printf("                        number of layers to store in VRAM\n");
+    printf("  -ts SPLIT --tensor-split SPLIT\n");
+    printf("                        how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1\n");
+    printf("  -mg i, --main-gpu i   the GPU to use for scratch and small tensors\n");
+    printf("  -nommq, --no-mul-mat-q\n");
+    printf("                        use cuBLAS instead of custom mul_mat_q CUDA kernels.\n");
+    printf("                        Not recommended since this is both slower and uses more VRAM.\n");
+#endif
+    printf("  -m FNAME, --model FNAME\n");
+    printf("                        model path (default: %s)\n", params.model.c_str());
+    printf("  -a ALIAS, --alias ALIAS\n");
+    printf("                        set an alias for the model, will be added as `model` field in completion response\n");
+    printf("  --lora FNAME          apply LoRA adapter (implies --no-mmap)\n");
+    printf("  --lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter\n");
+    printf("  --host                ip address to listen (default  (default: %s)\n", sparams.hostname.c_str());
+    printf("  --port PORT           port to listen (default  (default: %d)\n", sparams.port);
+    printf("  --path PUBLIC_PATH    path from which to serve static files (default %s)\n", sparams.public_path.c_str());
+    printf("  -to N, --timeout N    server read/write timeout in seconds (default: %d)\n", sparams.read_timeout);
+
+    // new arguments
+    printf("  -np N, --parallel N   number of parallel sequences to decode (default: %d)\n", params.n_parallel);
+    printf("  -cb, --cont-batching  enable continuous batching (a.k.a dynamic batching) (default: disabled)\n");
+    printf("  -f FNAME, --file FNAME\n");
+    printf("                        load a system prompt from a file.\n");
+    printf("\n");
+}
+
+static void server_params_parse(int argc, char **argv, server_params &sparams,
+                                gpt_params &params)
+{
+    gpt_params default_params;
+    server_params default_sparams;
+    std::string arg;
+    bool invalid_param = false;
+
+    for (int i = 1; i < argc; i++)
+    {
+        arg = argv[i];
+        if (arg == "--port")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            sparams.port = std::stoi(argv[i]);
+        }
+        else if (arg == "--host")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            sparams.hostname = argv[i];
+        }
+        else if (arg == "--path")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            sparams.public_path = argv[i];
+        }
+        else if (arg == "--timeout" || arg == "-to")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            sparams.read_timeout = std::stoi(argv[i]);
+            sparams.write_timeout = std::stoi(argv[i]);
+        }
+        else if (arg == "-m" || arg == "--model")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.model = argv[i];
+        }
+        else if (arg == "-a" || arg == "--alias")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.model_alias = argv[i];
+        }
+        else if (arg == "-h" || arg == "--help")
+        {
+            server_print_usage(argv[0], default_params, default_sparams);
+            exit(0);
+        }
+        else if (arg == "-c" || arg == "--ctx-size" || arg == "--ctx_size")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.n_ctx = std::stoi(argv[i]);
+        }
+        else if (arg == "--rope-freq-base")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.rope_freq_base = std::stof(argv[i]);
+        }
+        else if (arg == "--rope-freq-scale")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.rope_freq_scale = std::stof(argv[i]);
+        }
+        else if (arg == "--memory-f32" || arg == "--memory_f32")
+        {
+            params.memory_f16 = false;
+        }
+        else if (arg == "--threads" || arg == "-t")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.n_threads = std::stoi(argv[i]);
+        }
+        else if (arg == "-b" || arg == "--batch-size")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.n_batch = std::stoi(argv[i]);
+            params.n_batch = std::min(512, params.n_batch);
+        }
+        else if (arg == "--gpu-layers" || arg == "-ngl" || arg == "--n-gpu-layers")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+#ifdef LLAMA_SUPPORTS_GPU_OFFLOAD
+            params.n_gpu_layers = std::stoi(argv[i]);
+#else
+            LOG_TEE("Not compiled with GPU offload support, --n-gpu-layers option will be ignored. "
+                        "See main README.md for information on enabling GPU BLAS support\n");
+#endif
+        }
+        else if (arg == "--tensor-split" || arg == "-ts")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+#ifdef GGML_USE_CUBLAS
+            std::string arg_next = argv[i];
+
+            // split string by , and /
+            const std::regex regex{R"([,/]+)"};
+            std::sregex_token_iterator it{arg_next.begin(), arg_next.end(), regex, -1};
+            std::vector<std::string> split_arg{it, {}};
+            GGML_ASSERT(split_arg.size() <= LLAMA_MAX_DEVICES);
+
+            for (size_t i_device = 0; i_device < LLAMA_MAX_DEVICES; ++i_device)
+            {
+                if (i_device < split_arg.size())
+                {
+                    params.tensor_split[i_device] = std::stof(split_arg[i_device]);
+                }
+                else
+                {
+                    params.tensor_split[i_device] = 0.0f;
+                }
+            }
+#else
+            LOG_TEE("llama.cpp was compiled without cuBLAS. It is not possible to set a tensor split.\n");
+#endif // GGML_USE_CUBLAS
+        }
+        else if (arg == "--no-mul-mat-q" || arg == "-nommq")
+        {
+#ifdef GGML_USE_CUBLAS
+            params.mul_mat_q = false;
+#else
+            LOG_TEE("warning: llama.cpp was compiled without cuBLAS. Disabling mul_mat_q kernels has no effect.\n");
+#endif // GGML_USE_CUBLAS
+        }
+        else if (arg == "--main-gpu" || arg == "-mg")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+#ifdef GGML_USE_CUBLAS
+            params.main_gpu = std::stoi(argv[i]);
+#else
+            LOG_TEE("llama.cpp was compiled without cuBLAS. It is not possible to set a main GPU.");
+#endif
+        }
+        else if (arg == "--lora")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.lora_adapter.push_back({argv[i], 1.0f});
+            params.use_mmap = false;
+        }
+        else if (arg == "--lora-scaled")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            const char * lora_adapter = argv[i];
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.lora_adapter.push_back(make_tuple(lora_adapter, std::stof(argv[i])));
+            params.use_mmap = false;
+        }
+        else if (arg == "--lora-base")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.lora_base = argv[i];
+        }
+        else if (arg == "--mlock")
+        {
+            params.use_mlock = true;
+        }
+        else if (arg == "--no-mmap")
+        {
+            params.use_mmap = false;
+        }
+        else if (arg == "--numa")
+        {
+            params.numa = true;
+        } else if (arg == "-cb" || arg == "--cont-batching")
+        {
+            params.cont_batching = true;
+        }
+        else if (arg == "-np" || arg == "--parallel")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.n_parallel = std::stoi(argv[i]);
+        } else if (arg == "-n" || arg == "--n-predict")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.n_predict = std::stoi(argv[i]);
+        } else if (arg == "-r" || arg == "--reverse-prompt")
+        {
+            if (++i >= argc)
+            {
+                invalid_param = true;
+                break;
+            }
+            params.antiprompt.push_back(argv[i]);
+        }
+        else
+        {
+            fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
+            server_print_usage(argv[0], default_params, default_sparams);
+            exit(1);
+        }
+    }
+
+    if (invalid_param)
+    {
+        fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
+        server_print_usage(argv[0], default_params, default_sparams);
+        exit(1);
+    }
+}
+
+int main(int argc, char **argv)
+{
+    gpt_params params;
+
+    server_params sparams;
+
+    server_params_parse(argc, argv, sparams, params);
+
+#ifndef LOG_DISABLE_LOGS
+    log_set_target(log_filename_generator("server-parallel", "log"));
+    LOG_TEE("Log start\n");
+    log_dump_cmdline(argc, argv);
+#endif // LOG_DISABLE_LOGS
+
+    llama_backend_init(params.numa);
+
+    // load the target model
+    params.logits_all = true;
+    server_parallel_context llama;
+
+    if(!llama.loadModel(params)) {
+        return 1;
+    }
+
+    llama.initialize();
+
+    Server svr;
+
+    svr.Get("/", [&](const Request & /*req*/, Response &res)
+            { res.set_content(index_html_, "text/html"); });
+
+    svr.Get("/index.js", [&](const Request & /*req*/, Response &res)
+            { res.set_content(index_js_, "text/html"); });
+
+    svr.Get("/props", [&llama](const Request & /*req*/, Response &res)
+            {
+                json data = {
+                    { "user_name", llama.user_name.c_str() },
+                    { "assistant_name", llama.assistant_name.c_str() }
+                };
+                res.set_content(data.dump(), "application/json"); });
+
+    svr.Post("/completion", [&llama](const Request &req, Response &res)
+             {
+        llama_client_slot* slot = llama.requestCompletion(json::parse(req.body));
+        // Verify if the slot exist
+        if (slot) {
+            res.set_chunked_content_provider("text/event-stream",
+                [slot](size_t /*offset*/, DataSink &sink) {
+                    if(slot->available()) { // slot has been released
+                        sink.done();
+                        return false;
+                    }
+                    if(slot->hasNewToken()) { // new token notification
+                        stringstream ss;
+                        json res_d = {{ "content", slot->sampled_token_str }};
+                        ss << "data: " << res_d.dump() << "\n\n";
+                        string result = ss.str();
+                        if(!sink.write(result.c_str(), result.size())) {
+                            slot->release();
+                            return false;
+                        }
+                    }
+                    return true;
+                });
+        } else {
+            LOG_TEE("slot unavailable\n");
+            res.status = 404;
+            res.set_content("slot_error", "text/plain");
+        } });
+
+    thread t([&llama]()
+             {
+            bool running = true;
+            while (running)
+            {
+                running = llama.updateSlots();
+            } });
+
+    svr.set_read_timeout(sparams.read_timeout);
+    svr.set_write_timeout(sparams.write_timeout);
+
+    if (!svr.bind_to_port(sparams.hostname, sparams.port))
+    {
+        fprintf(stderr, "\ncouldn't bind to server socket: hostname=%s port=%d\n\n", sparams.hostname.c_str(), sparams.port);
+        return 1;
+    }
+
+    // Set the base directory for serving static files
+    svr.set_base_dir(sparams.public_path);
+
+    // to make it ctrl+clickable:
+    printf("\nllama server listening at http://%s:%d\n\n", sparams.hostname.c_str(), sparams.port);
+
+    if (!svr.listen_after_bind())
+    {
+        return 1;
+    }
+}
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -114,9 +114,9 @@ node index.js

    `top_k`: Limit the next token selection to the K most probable tokens (default: 40).

-    `top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).
+    `top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).

-    `n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).
+    `n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: 128, -1 = infinity).

    `n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
    By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
@@ -156,8 +156,6 @@ node index.js

    `logit_bias`: Modify the likelihood of a token appearing in the generated text completion. For example, use `"logit_bias": [[15043,1.0]]` to increase the likelihood of the token 'Hello', or `"logit_bias": [[15043,-1.0]]` to decrease its likelihood. Setting the value to false, `"logit_bias": [[15043,false]]` ensures that the token `Hello` is never produced (default: []).

-    `n_probs`: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0)
-
 -   **POST** `/tokenize`: Tokenize a given text.

    *Options:*
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -534,20 +534,98 @@ struct llama_server_context
            return result;
        }

+        // out of user input, sample next token
+        const float temp = params.temp;
+        const int32_t top_k = params.top_k <= 0 ? llama_n_vocab(model) : params.top_k;
+        const float top_p = params.top_p;
+        const float tfs_z = params.tfs_z;
+        const float typical_p = params.typical_p;
+        const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
+        const float repeat_penalty = params.repeat_penalty;
+        const float alpha_presence = params.presence_penalty;
+        const float alpha_frequency = params.frequency_penalty;
+        const int mirostat = params.mirostat;
+        const float mirostat_tau = params.mirostat_tau;
+        const float mirostat_eta = params.mirostat_eta;
+        const bool penalize_nl = params.penalize_nl;
+        const int32_t n_probs = params.n_probs;
+
        {
-            // out of user input, sample next token
-            std::vector<llama_token_data> candidates;
-            candidates.reserve(llama_n_vocab(model));
+            auto *logits = llama_get_logits(ctx);
+            auto n_vocab = llama_n_vocab(model);

-            result.tok = llama_sample_token(ctx, NULL, grammar, params, last_n_tokens, candidates);
-
-            llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
-
-            const int32_t n_probs = params.n_probs;
-            if (params.temp <= 0 && n_probs > 0)
+            // Apply params.logit_bias map
+            for (const auto &it : params.logit_bias)
            {
-                // For llama_sample_token_greedy we need to sort candidates
-                llama_sample_softmax(ctx, &candidates_p);
+                logits[it.first] += it.second;
+            }
+
+            std::vector<llama_token_data> candidates;
+            candidates.reserve(n_vocab);
+            for (llama_token token_id = 0; token_id < n_vocab; token_id++)
+            {
+                candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
+            }
+
+            llama_token_data_array candidates_p = {candidates.data(), candidates.size(), false};
+
+            // Apply penalties
+            float nl_logit = logits[llama_token_nl(ctx)];
+            auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
+            llama_sample_repetition_penalty(ctx, &candidates_p,
+                                            last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                                            last_n_repeat, repeat_penalty);
+            llama_sample_frequency_and_presence_penalties(ctx, &candidates_p,
+                                                          last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                                                          last_n_repeat, alpha_frequency, alpha_presence);
+            if (!penalize_nl)
+            {
+                logits[llama_token_nl(ctx)] = nl_logit;
+            }
+
+            if (grammar != nullptr) {
+                llama_sample_grammar(ctx, &candidates_p, grammar);
+            }
+
+            if (temp <= 0)
+            {
+                // Greedy sampling
+                result.tok = llama_sample_token_greedy(ctx, &candidates_p);
+                if (n_probs > 0)
+                {
+                    llama_sample_softmax(ctx, &candidates_p);
+                }
+            }
+            else
+            {
+                if (mirostat == 1)
+                {
+                    static float mirostat_mu = 2.0f * mirostat_tau;
+                    const int mirostat_m = 100;
+                    llama_sample_temp(ctx, &candidates_p, temp);
+                    result.tok = llama_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
+                }
+                else if (mirostat == 2)
+                {
+                    static float mirostat_mu = 2.0f * mirostat_tau;
+                    llama_sample_temp(ctx, &candidates_p, temp);
+                    result.tok = llama_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
+                }
+                else
+                {
+                    // Temperature sampling
+                    size_t min_keep = std::max(1, n_probs);
+                    llama_sample_top_k(ctx, &candidates_p, top_k, min_keep);
+                    llama_sample_tail_free(ctx, &candidates_p, tfs_z, min_keep);
+                    llama_sample_typical(ctx, &candidates_p, typical_p, min_keep);
+                    llama_sample_top_p(ctx, &candidates_p, top_p, min_keep);
+                    llama_sample_temp(ctx, &candidates_p, temp);
+                    result.tok = llama_sample_token(ctx, &candidates_p);
+                }
+            }
+
+            if (grammar != nullptr) {
+                llama_grammar_accept_token(ctx, grammar, result.tok);
            }

            for (size_t i = 0; i < std::min(candidates_p.size, (size_t)n_probs); ++i)
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -81,18 +81,18 @@ struct ggml_metal_context {
    GGML_METAL_DECL_KERNEL(get_rows_q6_K);
    GGML_METAL_DECL_KERNEL(rms_norm);
    GGML_METAL_DECL_KERNEL(norm);
-    GGML_METAL_DECL_KERNEL(mul_mv_f32_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_f16_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_f16_f32_1row);
-    GGML_METAL_DECL_KERNEL(mul_mv_f16_f32_l4);
-    GGML_METAL_DECL_KERNEL(mul_mv_q4_0_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q4_1_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q8_0_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q2_K_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q3_K_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q4_K_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q5_K_f32);
-    GGML_METAL_DECL_KERNEL(mul_mv_q6_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_f32_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_f16_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_f16_f32_1row);
+    GGML_METAL_DECL_KERNEL(mul_mat_f16_f32_l4);
+    GGML_METAL_DECL_KERNEL(mul_mat_q4_0_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q4_1_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q8_0_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q2_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q3_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q4_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q5_K_f32);
+    GGML_METAL_DECL_KERNEL(mul_mat_q6_K_f32);
    GGML_METAL_DECL_KERNEL(mul_mm_f32_f32);
    GGML_METAL_DECL_KERNEL(mul_mm_f16_f32);
    GGML_METAL_DECL_KERNEL(mul_mm_q4_0_f32);
@@ -109,8 +109,6 @@ struct ggml_metal_context {
    GGML_METAL_DECL_KERNEL(cpy_f32_f16);
    GGML_METAL_DECL_KERNEL(cpy_f32_f32);
    GGML_METAL_DECL_KERNEL(cpy_f16_f16);
-    GGML_METAL_DECL_KERNEL(concat);
-    GGML_METAL_DECL_KERNEL(sqr);

 #undef GGML_METAL_DECL_KERNEL
 };
@@ -185,44 +183,56 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {

    ctx->d_queue = dispatch_queue_create("ggml-metal", DISPATCH_QUEUE_CONCURRENT);

-    // load library
+#ifdef GGML_SWIFT
+    // load the default.metallib file
    {
-        NSBundle * bundle = nil;
-#ifdef SWIFT_PACKAGE
-        bundle = SWIFTPM_MODULE_BUNDLE;
-#else
-        bundle = [NSBundle bundleForClass:[GGMLMetalClass class]];
-#endif
        NSError * error = nil;
-        NSString * libPath = [bundle pathForResource:@"default" ofType:@"metallib"];
-        if (libPath != nil) {
-            NSURL * libURL = [NSURL fileURLWithPath:libPath];
-            GGML_METAL_LOG_INFO("%s: loading '%s'\n", __func__, [libPath UTF8String]);
-            ctx->library = [ctx->device newLibraryWithURL:libURL error:&error];
-        } else {
-            GGML_METAL_LOG_INFO("%s: default.metallib not found, loading from source\n", __func__);

-            NSString * sourcePath = [bundle pathForResource:@"ggml-metal" ofType:@"metal"];
-            GGML_METAL_LOG_INFO("%s: loading '%s'\n", __func__, [sourcePath UTF8String]);
-            NSString * src = [NSString stringWithContentsOfFile:sourcePath encoding:NSUTF8StringEncoding error:&error];
-            if (error) {
-                GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
-                return NULL;
-            }
+        NSBundle * bundle = [NSBundle bundleForClass:[GGMLMetalClass class]];
+        NSString * llamaBundlePath = [bundle pathForResource:@"llama_llama" ofType:@"bundle"];
+        NSBundle * llamaBundle = [NSBundle bundleWithPath:llamaBundlePath];
+        NSString * libPath = [llamaBundle pathForResource:@"default" ofType:@"metallib"];
+        NSURL * libURL = [NSURL fileURLWithPath:libPath];

-            MTLCompileOptions* options = nil;
-#ifdef GGML_QKK_64
-            options = [MTLCompileOptions new];
-            options.preprocessorMacros = @{ @"QK_K" : @(64) };
-#endif
-            ctx->library = [ctx->device newLibraryWithSource:src options:options error:&error];
-        }
+        // Load the metallib file into a Metal library
+        ctx->library = [ctx->device newLibraryWithURL:libURL error:&error];

        if (error) {
            GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
            return NULL;
        }
    }
+#else
+    UNUSED(msl_library_source);
+
+    // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
+    {
+        NSError * error = nil;
+
+        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
+        NSBundle * bundle = [NSBundle bundleForClass:[GGMLMetalClass class]];
+        NSString * path   = [bundle pathForResource:@"ggml-metal" ofType:@"metal"];
+        GGML_METAL_LOG_INFO("%s: loading '%s'\n", __func__, [path UTF8String]);
+
+        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
+        if (error) {
+            GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
+            return NULL;
+        }
+
+#ifdef GGML_QKK_64
+        MTLCompileOptions* options = [MTLCompileOptions new];
+        options.preprocessorMacros = @{ @"QK_K" : @(64) };
+        ctx->library = [ctx->device newLibraryWithSource:src options:options error:&error];
+#else
+        ctx->library = [ctx->device newLibraryWithSource:src options:nil error:&error];
+#endif
+        if (error) {
+            GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
+            return NULL;
+        }
+    }
+#endif

    // load kernels
    {
@@ -262,57 +272,40 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
        GGML_METAL_ADD_KERNEL(get_rows_q6_K);
        GGML_METAL_ADD_KERNEL(rms_norm);
        GGML_METAL_ADD_KERNEL(norm);
-        GGML_METAL_ADD_KERNEL(mul_mv_f32_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_f16_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_f16_f32_1row);
-        GGML_METAL_ADD_KERNEL(mul_mv_f16_f32_l4);
-        GGML_METAL_ADD_KERNEL(mul_mv_q4_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q4_1_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q8_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q2_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q3_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q4_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q5_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mv_q6_K_f32);
-        if ([ctx->device supportsFamily:MTLGPUFamilyApple7]) {
-            GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
-            GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
-        }
+        GGML_METAL_ADD_KERNEL(mul_mat_f32_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_f16_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_f16_f32_1row);
+        GGML_METAL_ADD_KERNEL(mul_mat_f16_f32_l4);
+        GGML_METAL_ADD_KERNEL(mul_mat_q4_0_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q4_1_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q8_0_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q2_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q3_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q4_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q5_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mat_q6_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
        GGML_METAL_ADD_KERNEL(rope_f32);
        GGML_METAL_ADD_KERNEL(rope_f16);
        GGML_METAL_ADD_KERNEL(alibi_f32);
        GGML_METAL_ADD_KERNEL(cpy_f32_f16);
        GGML_METAL_ADD_KERNEL(cpy_f32_f32);
        GGML_METAL_ADD_KERNEL(cpy_f16_f16);
-        GGML_METAL_ADD_KERNEL(concat);
-        GGML_METAL_ADD_KERNEL(sqr);

 #undef GGML_METAL_ADD_KERNEL
    }

-#if TARGET_OS_OSX
-    // print MTL GPU family:
-    GGML_METAL_LOG_INFO("%s: GPU name:   %s\n", __func__, [[ctx->device name] UTF8String]);
-
-    // determine max supported GPU family
-    // https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
-    // https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
-    for (int i = MTLGPUFamilyApple1 + 20; i >= MTLGPUFamilyApple1; --i) {
-        if ([ctx->device supportsFamily:i]) {
-            GGML_METAL_LOG_INFO("%s: GPU family: MTLGPUFamilyApple%d (%d)\n", __func__, i - MTLGPUFamilyApple1 + 1, i);
-            break;
-        }
-    }
-
    GGML_METAL_LOG_INFO("%s: hasUnifiedMemory              = %s\n",       __func__, ctx->device.hasUnifiedMemory ? "true" : "false");
+#if TARGET_OS_OSX
    GGML_METAL_LOG_INFO("%s: recommendedMaxWorkingSetSize  = %8.2f MB\n", __func__, ctx->device.recommendedMaxWorkingSetSize / 1024.0 / 1024.0);
    if (ctx->device.maxTransferRate != 0) {
        GGML_METAL_LOG_INFO("%s: maxTransferRate               = %8.2f MB/s\n", __func__, ctx->device.maxTransferRate / 1024.0 / 1024.0);
@@ -354,38 +347,34 @@ void ggml_metal_free(struct ggml_metal_context * ctx) {
    GGML_METAL_DEL_KERNEL(get_rows_q6_K);
    GGML_METAL_DEL_KERNEL(rms_norm);
    GGML_METAL_DEL_KERNEL(norm);
-    GGML_METAL_DEL_KERNEL(mul_mv_f32_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_f16_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_f16_f32_1row);
-    GGML_METAL_DEL_KERNEL(mul_mv_f16_f32_l4);
-    GGML_METAL_DEL_KERNEL(mul_mv_q4_0_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q4_1_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q8_0_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q2_K_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q3_K_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q4_K_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q5_K_f32);
-    GGML_METAL_DEL_KERNEL(mul_mv_q6_K_f32);
-    if ([ctx->device supportsFamily:MTLGPUFamilyApple7]) {
-        GGML_METAL_DEL_KERNEL(mul_mm_f32_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_f16_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q4_0_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q8_0_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q4_1_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q2_K_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q3_K_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q4_K_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q5_K_f32);
-        GGML_METAL_DEL_KERNEL(mul_mm_q6_K_f32);
-    }
+    GGML_METAL_DEL_KERNEL(mul_mat_f32_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_f16_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_f16_f32_1row);
+    GGML_METAL_DEL_KERNEL(mul_mat_f16_f32_l4);
+    GGML_METAL_DEL_KERNEL(mul_mat_q4_0_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q4_1_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q8_0_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q2_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q3_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q4_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q5_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mat_q6_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_f32_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_f16_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q4_0_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q8_0_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q4_1_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q2_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q3_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q4_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q5_K_f32);
+    GGML_METAL_DEL_KERNEL(mul_mm_q6_K_f32);
    GGML_METAL_DEL_KERNEL(rope_f32);
    GGML_METAL_DEL_KERNEL(rope_f16);
    GGML_METAL_DEL_KERNEL(alibi_f32);
    GGML_METAL_DEL_KERNEL(cpy_f32_f16);
    GGML_METAL_DEL_KERNEL(cpy_f32_f32);
    GGML_METAL_DEL_KERNEL(cpy_f16_f16);
-    GGML_METAL_DEL_KERNEL(concat);
-    GGML_METAL_DEL_KERNEL(sqr);

 #undef GGML_METAL_DEL_KERNEL

@@ -442,7 +431,7 @@ static id<MTLBuffer> ggml_metal_get_buffer(struct ggml_metal_context * ctx, stru
    for (int i = 0; i < ctx->n_buffers; ++i) {
        const int64_t ioffs = (int64_t) t->data - (int64_t) ctx->buffers[i].data;

-        //GGML_METAL_LOG_INFO("ioffs = %10ld, tsize = %10ld, sum = %10ld, ctx->buffers[%d].size = %10ld, name = %s\n", ioffs, tsize, ioffs + tsize, i, ctx->buffers[i].size, ctx->buffers[i].name);
+        //metal_printf("ioffs = %10ld, tsize = %10ld, sum = %10ld, ctx->buffers[%d].size = %10ld, name = %s\n", ioffs, tsize, ioffs + tsize, i, ctx->buffers[i].size, ctx->buffers[i].name);
        if (ioffs >= 0 && ioffs + tsize <= (int64_t) ctx->buffers[i].size) {
            *offs = (size_t) ioffs;

@@ -777,44 +766,6 @@ void ggml_metal_graph_compute(
                        {
                            // noop
                        } break;
-                    case GGML_OP_CONCAT:
-                        {
-                            const int64_t nb = ne00;
-
-                            [encoder setComputePipelineState:ctx->pipeline_concat];
-                            [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
-                            [encoder setBuffer:id_src1 offset:offs_src1 atIndex:1];
-                            [encoder setBuffer:id_dst  offset:offs_dst  atIndex:2];
-                            [encoder setBytes:&ne00 length:sizeof(ne00) atIndex:3];
-                            [encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
-                            [encoder setBytes:&ne02 length:sizeof(ne02) atIndex:5];
-                            [encoder setBytes:&ne03 length:sizeof(ne03) atIndex:6];
-                            [encoder setBytes:&nb00 length:sizeof(nb00) atIndex:7];
-                            [encoder setBytes:&nb01 length:sizeof(nb01) atIndex:8];
-                            [encoder setBytes:&nb02 length:sizeof(nb02) atIndex:9];
-                            [encoder setBytes:&nb03 length:sizeof(nb03) atIndex:10];
-                            [encoder setBytes:&ne10 length:sizeof(ne10) atIndex:11];
-                            [encoder setBytes:&ne11 length:sizeof(ne11) atIndex:12];
-                            [encoder setBytes:&ne12 length:sizeof(ne12) atIndex:13];
-                            [encoder setBytes:&ne13 length:sizeof(ne13) atIndex:14];
-                            [encoder setBytes:&nb10 length:sizeof(nb10) atIndex:15];
-                            [encoder setBytes:&nb11 length:sizeof(nb11) atIndex:16];
-                            [encoder setBytes:&nb12 length:sizeof(nb12) atIndex:17];
-                            [encoder setBytes:&nb13 length:sizeof(nb13) atIndex:18];
-                            [encoder setBytes:&ne0  length:sizeof(ne0)  atIndex:19];
-                            [encoder setBytes:&ne1  length:sizeof(ne1)  atIndex:20];
-                            [encoder setBytes:&ne2  length:sizeof(ne2)  atIndex:21];
-                            [encoder setBytes:&ne3  length:sizeof(ne3)  atIndex:22];
-                            [encoder setBytes:&nb0  length:sizeof(nb0)  atIndex:23];
-                            [encoder setBytes:&nb1  length:sizeof(nb1)  atIndex:24];
-                            [encoder setBytes:&nb2  length:sizeof(nb2)  atIndex:25];
-                            [encoder setBytes:&nb3  length:sizeof(nb3)  atIndex:26];
-                            [encoder setBytes:&nb   length:sizeof(nb)   atIndex:27];
-
-                            const int nth = MIN(1024, ne0);
-
-                            [encoder dispatchThreadgroups:MTLSizeMake(ne1, ne2, ne3) threadsPerThreadgroup:MTLSizeMake(nth, 1, 1)];
-                        } break;
                    case GGML_OP_ADD:
                        {
                            GGML_ASSERT(ggml_is_contiguous(src0));
@@ -910,10 +861,9 @@ void ggml_metal_graph_compute(
                            [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];
                            [encoder setBytes:&scale length:sizeof(scale) atIndex:2];

-                            const int64_t n = ggml_nelements(dst);
-                            GGML_ASSERT(n % 4 == 0);
+                            const int64_t n = ggml_nelements(dst)/4;

-                            [encoder dispatchThreadgroups:MTLSizeMake(n/4, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
+                            [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
                        } break;
                    case GGML_OP_UNARY:
                        switch (ggml_get_unary_op(gf->nodes[i])) {
@@ -923,10 +873,9 @@ void ggml_metal_graph_compute(
                                    [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
                                    [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];

-                                    const int64_t n = ggml_nelements(dst);
-                                    GGML_ASSERT(n % 4 == 0);
+                                    const int64_t n = ggml_nelements(dst)/4;

-                                    [encoder dispatchThreadgroups:MTLSizeMake(n/4, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
+                                    [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
                                } break;
                            case GGML_UNARY_OP_RELU:
                                {
@@ -944,10 +893,9 @@ void ggml_metal_graph_compute(
                                    [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
                                    [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];

-                                    const int64_t n = ggml_nelements(dst);
-                                    GGML_ASSERT(n % 4 == 0);
+                                    const int64_t n = ggml_nelements(dst)/4;

-                                    [encoder dispatchThreadgroups:MTLSizeMake(n/4, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
+                                    [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
                                } break;
                            default:
                                {
@@ -955,17 +903,6 @@ void ggml_metal_graph_compute(
                                    GGML_ASSERT(false);
                                }
                        } break;
-                    case GGML_OP_SQR:
-                        {
-                            GGML_ASSERT(ggml_is_contiguous(src0));
-
-                            [encoder setComputePipelineState:ctx->pipeline_sqr];
-                            [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
-                            [encoder setBuffer:id_dst  offset:offs_dst atIndex:1];
-
-                            const int64_t n = ggml_nelements(dst);
-                            [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
-                        } break;
                    case GGML_OP_SOFT_MAX:
                        {
                            const int nth = MIN(32, ne00);
@@ -1007,46 +944,21 @@ void ggml_metal_graph_compute(
                        } break;
                    case GGML_OP_MUL_MAT:
                        {
+                            // TODO: needs to be updated after PR: https://github.com/ggerganov/ggml/pull/224
+
                            GGML_ASSERT(ne00 == ne10);
+                            // GGML_ASSERT(ne02 == ne12); // Should be checked on individual data types until broadcast is implemented everywhere
+                            uint gqa = ne12/ne02;
                            GGML_ASSERT(ne03 == ne13);

-                            const uint gqa = ne12/ne02;
-
-                            // find the break-even point where the matrix-matrix kernel becomes more efficient compared
-                            // to the matrix-vector kernel
-                            int ne11_mm_min = 1;
-
-#if 0
-                            // the numbers below are measured on M2 Ultra for 7B and 13B models
-                            // these numbers do not translate to other devices or model sizes
-                            // TODO: need to find a better approach
-                            if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) {
-                                switch (src0t) {
-                                    case GGML_TYPE_F16:  ne11_mm_min = 2;  break;
-                                    case GGML_TYPE_Q8_0: ne11_mm_min = 7;  break;
-                                    case GGML_TYPE_Q2_K: ne11_mm_min = 15; break;
-                                    case GGML_TYPE_Q3_K: ne11_mm_min = 7;  break;
-                                    case GGML_TYPE_Q4_0:
-                                    case GGML_TYPE_Q4_1: ne11_mm_min = 15; break;
-                                    case GGML_TYPE_Q4_K: ne11_mm_min = 11; break;
-                                    case GGML_TYPE_Q5_0:                          // not tested yet
-                                    case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet
-                                    case GGML_TYPE_Q5_K: ne11_mm_min = 7;  break;
-                                    case GGML_TYPE_Q6_K: ne11_mm_min = 7;  break;
-                                    default:             ne11_mm_min = 1;  break;
-                                }
-                            }
-#endif
-
                            // for now the matrix-matrix multiplication kernel only works on A14+/M1+ SoCs
                            // AMD GPU and older A-chips will reuse matrix-vector multiplication kernel
-                            if ([ctx->device supportsFamily:MTLGPUFamilyApple7] &&
-                                !ggml_is_transposed(src0) &&
+                            if (!ggml_is_transposed(src0) &&
                                !ggml_is_transposed(src1) &&
                                src1t == GGML_TYPE_F32 &&
-                                ne00 % 32 == 0 &&
-                                ne11 > ne11_mm_min) {
-                                //printf("matrix: ne00 = %6d, ne01 = %6d, ne02 = %6d, ne11 = %6d, ne12 = %6d\n", ne00, ne01, ne02, ne11, ne12);
+                                [ctx->device supportsFamily:MTLGPUFamilyApple7] &&
+                                ne00%32 == 0 &&
+                                ne11 > 2) {
                                switch (src0->type) {
                                    case GGML_TYPE_F32:  [encoder setComputePipelineState:ctx->pipeline_mul_mm_f32_f32];  break;
                                    case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_mul_mm_f16_f32];  break;
@@ -1075,18 +987,17 @@ void ggml_metal_graph_compute(
                                [encoder setBytes:&ne1     length:sizeof(ne1)  atIndex:12];
                                [encoder setBytes:&gqa     length:sizeof(gqa)  atIndex:13];
                                [encoder setThreadgroupMemoryLength:8192 atIndex:0];
-                                [encoder dispatchThreadgroups:MTLSizeMake( (ne11 + 31)/32, (ne01 + 63)/64, ne12) threadsPerThreadgroup:MTLSizeMake(128, 1, 1)];
+                                [encoder dispatchThreadgroups:MTLSizeMake( (ne11+31)/32, (ne01+63) / 64, ne12) threadsPerThreadgroup:MTLSizeMake(128, 1, 1)];
                            } else {
                                int nth0 = 32;
                                int nth1 = 1;
                                int nrows = 1;
-                                //printf("vector: ne00 = %6d, ne01 = %6d, ne02 = %6d, ne11 = %6d, ne12 = %6d\n", ne00, ne01, ne02, ne11, ne12);

                                // use custom matrix x vector kernel
                                switch (src0t) {
                                    case GGML_TYPE_F32:
                                        {
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_f32_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_f32_f32];
                                            nrows = 4;
                                        } break;
                                    case GGML_TYPE_F16:
@@ -1094,12 +1005,12 @@ void ggml_metal_graph_compute(
                                            nth0 = 32;
                                            nth1 = 1;
                                            if (ne11 * ne12 < 4) {
-                                                [encoder setComputePipelineState:ctx->pipeline_mul_mv_f16_f32_1row];
+                                                [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32_1row];
                                            } else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
-                                                [encoder setComputePipelineState:ctx->pipeline_mul_mv_f16_f32_l4];
+                                                [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32_l4];
                                                nrows = ne11;
                                            } else {
-                                                [encoder setComputePipelineState:ctx->pipeline_mul_mv_f16_f32];
+                                                [encoder setComputePipelineState:ctx->pipeline_mul_mat_f16_f32];
                                                nrows = 4;
                                            }
                                        } break;
@@ -1110,7 +1021,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 8;
                                            nth1 = 8;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q4_0_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_0_f32];
                                        } break;
                                    case GGML_TYPE_Q4_1:
                                        {
@@ -1119,7 +1030,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 8;
                                            nth1 = 8;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q4_1_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_1_f32];
                                        } break;
                                    case GGML_TYPE_Q8_0:
                                        {
@@ -1128,7 +1039,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 8;
                                            nth1 = 8;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q8_0_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q8_0_f32];
                                        } break;
                                    case GGML_TYPE_Q2_K:
                                        {
@@ -1137,7 +1048,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 2;
                                            nth1 = 32;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q2_K_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q2_K_f32];
                                        } break;
                                    case GGML_TYPE_Q3_K:
                                        {
@@ -1146,7 +1057,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 2;
                                            nth1 = 32;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q3_K_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q3_K_f32];
                                        } break;
                                    case GGML_TYPE_Q4_K:
                                        {
@@ -1155,7 +1066,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 4; //1;
                                            nth1 = 8; //32;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q4_K_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_K_f32];
                                        } break;
                                    case GGML_TYPE_Q5_K:
                                        {
@@ -1164,7 +1075,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 2;
                                            nth1 = 32;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q5_K_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q5_K_f32];
                                        } break;
                                    case GGML_TYPE_Q6_K:
                                        {
@@ -1173,7 +1084,7 @@ void ggml_metal_graph_compute(

                                            nth0 = 2;
                                            nth1 = 32;
-                                            [encoder setComputePipelineState:ctx->pipeline_mul_mv_q6_K_f32];
+                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q6_K_f32];
                                        } break;
                                    default:
                                        {
@@ -1202,7 +1113,7 @@ void ggml_metal_graph_compute(
                                [encoder setBytes:&gqa  length:sizeof(gqa)  atIndex:17];

                                if (src0t == GGML_TYPE_Q4_0 || src0t == GGML_TYPE_Q4_1 || src0t == GGML_TYPE_Q8_0 ||
-                                    src0t == GGML_TYPE_Q2_K) { // || src0t == GGML_TYPE_Q4_K) {
+                                    src0t == GGML_TYPE_Q2_K) {// || src0t == GGML_TYPE_Q4_K) {
                                    [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 7)/8, ne11, ne12) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                }
                                else if (src0t == GGML_TYPE_Q4_K) {
@@ -1255,8 +1166,6 @@ void ggml_metal_graph_compute(
                        } break;
                    case GGML_OP_RMS_NORM:
                        {
-                            GGML_ASSERT(ne00 % 4 == 0);
-
                            float eps;
                            memcpy(&eps, dst->op_params, sizeof(float));

--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -13,8 +13,8 @@ typedef struct {

 #define QK4_1 32
 typedef struct {
-    half d;                 // delta
-    half m;                 // min
+    half d;          // delta
+    half m;          // min
    uint8_t qs[QK4_1 / 2];  // nibbles / quants
 } block_q4_1;

@@ -132,13 +132,6 @@ kernel void kernel_relu(
    dst[tpig] = max(0.0f, src0[tpig]);
 }

-kernel void kernel_sqr(
-        device const float * src0,
-        device       float * dst,
-        uint tpig[[thread_position_in_grid]]) {
-    dst[tpig] = src0[tpig] * src0[tpig];
-}
-
 constant float GELU_COEF_A    = 0.044715f;
 constant float SQRT_2_OVER_PI = 0.79788456080286535587989211986876f;

@@ -345,11 +338,10 @@ kernel void kernel_rms_norm(
        uint sgitg[[simdgroup_index_in_threadgroup]],
        uint tiisg[[thread_index_in_simdgroup]],
        uint   ntg[[threads_per_threadgroup]]) {
-    device const float4 * x        = (device const float4 *) ((device const char *) src0 + tgpig*nb01);
-    device const float  * x_scalar = (device const float  *) x;
-
-    float4 sumf = 0;
-    float all_sum = 0;
+    device const float4 * x = (device const float4 *) ((device const char *) src0 + tgpig*nb01);
+    device const float * x_scalar = (device const float *) x;
+    float4 sumf=0;
+    float all_sum=0;

    // parallel sum
    for (int i00 = tpitg; i00 < ne00/4; i00 += ntg) {
@@ -362,7 +354,6 @@ kernel void kernel_rms_norm(
    }

    threadgroup_barrier(mem_flags::mem_threadgroup);
-
    // broadcast, simd group number is ntg / 32
    for (uint i = ntg / 32 / 2; i > 0; i /= 2) {
       if (tpitg < i) {
@@ -370,9 +361,7 @@ kernel void kernel_rms_norm(
       }
    }
    if (tpitg == 0) {
-        for (int i = 4 * (ne00 / 4); i < ne00; i++) {
-            sum[0] += x_scalar[i];
-        }
+        for (int i = 4 * (ne00 / 4); i < ne00; i++) {sum[0] += x_scalar[i];}
        sum[0] /= ne00;
    }

@@ -387,9 +376,7 @@ kernel void kernel_rms_norm(
        y[i00] = x[i00] * scale;
    }
    if (tpitg == 0) {
-        for (int i00 = 4 * (ne00 / 4); i00 < ne00; i00++) {
-            y_scalar[i00] = x_scalar[i00] * scale;
-        }
+        for (int i00 = 4 * (ne00 / 4); i00 < ne00; i00++) {y_scalar[i00] = x_scalar[i00] * scale;}
    }
 }

@@ -429,8 +416,8 @@ inline float block_q_n_dot_y(device const block_q4_1 * qb_curr, float sumy, thre
 }

 // putting them in the kernel cause a significant performance penalty
-#define N_DST 4        // each SIMD group works on 4 rows
-#define N_SIMDGROUP 2  // number of SIMD groups in a thread group
+#define N_DST 4 // each SIMD group works on 4 rows
+#define N_SIMDGROUP 2 // number of SIMD groups in a thread group
 #define N_SIMDWIDTH 32 // assuming SIMD group size is 32
 //Note: This is a template, but strictly speaking it only applies to
 //      quantizations where the block size is 32. It also does not
@@ -441,23 +428,18 @@ void mul_vec_q_n_f32(device const void * src0, device const float * src1, device
                    int64_t ne00, int64_t ne01, int64_t ne02, int64_t ne10, int64_t ne12, int64_t ne0, int64_t ne1, uint gqa,
                    uint3 tgpig, uint tiisg, uint sgitg) {
    const int nb = ne00/QK4_0;
-
    const int r0 = tgpig.x;
    const int r1 = tgpig.y;
    const int im = tgpig.z;
-
    const int first_row = (r0 * nsg + sgitg) * nr;
-
    const uint offset0 = first_row * nb + im/gqa*(nb*ne0);
-
    device const block_q_type * x = (device const block_q_type *) src0 + offset0;
    device const float        * y = (device const float        *) src1 + r1*ne10 + im*ne00*ne1;
+    float yl[16];       // src1 vector cache
+    float sumf[nr]={0.f};

-    float yl[16]; // src1 vector cache
-    float sumf[nr] = {0.f};
-
-    const int ix = (tiisg/2);
-    const int il = (tiisg%2)*8;
+    const int ix = tiisg/2;
+    const int il = 8*(tiisg%2);

    device const float * yb = y + ix * QK4_0 + il;

@@ -468,7 +450,6 @@ void mul_vec_q_n_f32(device const void * src0, device const float * src1, device
            sumy += yb[i] + yb[i+1];
            yl[i+0] = yb[i+ 0];
            yl[i+1] = yb[i+ 1]/256.f;
-
            sumy += yb[i+16] + yb[i+17];
            yl[i+8] = yb[i+16]/16.f;
            yl[i+9] = yb[i+17]/4096.f;
@@ -484,12 +465,12 @@ void mul_vec_q_n_f32(device const void * src0, device const float * src1, device
    for (int row = 0; row < nr; ++row) {
        const float tot = simd_sum(sumf[row]);
        if (tiisg == 0 && first_row + row < ne01) {
-            dst[im*ne0*ne1 + r1*ne0 + first_row + row] = tot;
+            dst[r1*ne0 + im*ne0*ne1 + first_row + row] = tot;
        }
    }
 }

-kernel void kernel_mul_mv_q4_0_f32(
+kernel void kernel_mul_mat_q4_0_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -502,12 +483,12 @@ kernel void kernel_mul_mv_q4_0_f32(
        constant   int64_t & ne1[[buffer(16)]],
        constant   uint    & gqa[[buffer(17)]],
        uint3 tgpig[[threadgroup_position_in_grid]],
-        uint  tiisg[[thread_index_in_simdgroup]],
-        uint  sgitg[[simdgroup_index_in_threadgroup]]) {
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {
    mul_vec_q_n_f32<block_q4_0, N_DST, N_SIMDGROUP, N_SIMDWIDTH>(src0,src1,dst,ne00,ne01,ne02,ne10,ne12,ne0,ne1,gqa,tgpig,tiisg,sgitg);
 }

-kernel void kernel_mul_mv_q4_1_f32(
+kernel void kernel_mul_mat_q4_1_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -527,7 +508,7 @@ kernel void kernel_mul_mv_q4_1_f32(

 #define NB_Q8_0 8

-kernel void kernel_mul_mv_q8_0_f32(
+kernel void kernel_mul_mat_q8_0_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -591,7 +572,7 @@ kernel void kernel_mul_mv_q8_0_f32(

 #define N_F32_F32 4

-kernel void kernel_mul_mv_f32_f32(
+kernel void kernel_mul_mat_f32_f32(
        device const  char * src0,
        device const  char * src1,
        device       float * dst,
@@ -662,7 +643,7 @@ kernel void kernel_mul_mv_f32_f32(
    }
 }

-kernel void kernel_mul_mv_f16_f32_1row(
+kernel void kernel_mul_mat_f16_f32_1row(
        device const  char * src0,
        device const  char * src1,
        device       float * dst,
@@ -681,7 +662,7 @@ kernel void kernel_mul_mv_f16_f32_1row(
        constant   int64_t & ne0,
        constant   int64_t & ne1,
        uint3 tgpig[[threadgroup_position_in_grid]],
-        uint  tiisg[[thread_index_in_simdgroup]]) {
+        uint tiisg[[thread_index_in_simdgroup]]) {

    const int64_t r0 = tgpig.x;
    const int64_t r1 = tgpig.y;
@@ -716,7 +697,7 @@ kernel void kernel_mul_mv_f16_f32_1row(

 #define N_F16_F32 4

-kernel void kernel_mul_mv_f16_f32(
+kernel void kernel_mul_mat_f16_f32(
        device const  char * src0,
        device const  char * src1,
        device       float * dst,
@@ -788,7 +769,7 @@ kernel void kernel_mul_mv_f16_f32(
 }

 // Assumes row size (ne00) is a multiple of 4
-kernel void kernel_mul_mv_f16_f32_l4(
+kernel void kernel_mul_mat_f16_f32_l4(
        device const  char * src0,
        device const  char * src1,
        device       float * dst,
@@ -1117,62 +1098,6 @@ kernel void kernel_cpy_f32_f32(
    }
 }

-kernel void kernel_concat(
-    device const char * src0,
-    device const char * src1,
-    device       char * dst,
-    constant   int64_t & ne00,
-    constant   int64_t & ne01,
-    constant   int64_t & ne02,
-    constant   int64_t & ne03,
-    constant  uint64_t & nb00,
-    constant  uint64_t & nb01,
-    constant  uint64_t & nb02,
-    constant  uint64_t & nb03,
-    constant   int64_t & ne10,
-    constant   int64_t & ne11,
-    constant   int64_t & ne12,
-    constant   int64_t & ne13,
-    constant  uint64_t & nb10,
-    constant  uint64_t & nb11,
-    constant  uint64_t & nb12,
-    constant  uint64_t & nb13,
-    constant   int64_t & ne0,
-    constant   int64_t & ne1,
-    constant   int64_t & ne2,
-    constant   int64_t & ne3,
-    constant  uint64_t & nb0,
-    constant  uint64_t & nb1,
-    constant  uint64_t & nb2,
-    constant  uint64_t & nb3,
-    uint3 tgpig[[threadgroup_position_in_grid]],
-    uint3 tpitg[[thread_position_in_threadgroup]],
-    uint3   ntg[[threads_per_threadgroup]]) {
-
-    const int64_t i03 = tgpig.z;
-    const int64_t i02 = tgpig.y;
-    const int64_t i01 = tgpig.x;
-
-    const int64_t i13 = i03 % ne13;
-    const int64_t i12 = i02 % ne12;
-    const int64_t i11 = i01 % ne11;
-
-    device const char * src0_ptr = src0 + i03 * nb03 + i02 * nb02 + i01 * nb01 + tpitg.x*nb00;
-    device const char * src1_ptr = src1 + i13*nb13 + i12*nb12 + i11*nb11 + tpitg.x*nb10;
-    device       char * dst_ptr  = dst  + i03*nb3  + i02*nb2  + i01*nb1  + tpitg.x*nb0;
-
-    for (int i0 = tpitg.x; i0 < ne0; i0 += ntg.x) {
-        if (i02 < ne02) {
-            ((device float *)dst_ptr)[0] = ((device float *)src0_ptr)[0];
-            src0_ptr += ntg.x*nb00;
-        } else {
-            ((device float *)dst_ptr)[0] = ((device float *)src1_ptr)[0];
-            src1_ptr += ntg.x*nb10;
-        }
-        dst_ptr += ntg.x*nb0;
-    }
-}
-
 //============================================ k-quants ======================================================

 #ifndef QK_K
@@ -1265,7 +1190,7 @@ static inline uchar4 get_scale_min_k4(int j, device const uint8_t * q) {

 //====================================== dot products =========================

-kernel void kernel_mul_mv_q2_K_f32(
+kernel void kernel_mul_mat_q2_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -1409,7 +1334,7 @@ kernel void kernel_mul_mv_q2_K_f32(
 }

 #if QK_K == 256
-kernel void kernel_mul_mv_q3_K_f32(
+kernel void kernel_mul_mat_q3_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -1561,7 +1486,7 @@ kernel void kernel_mul_mv_q3_K_f32(
    }
 }
 #else
-kernel void kernel_mul_mv_q3_K_f32(
+kernel void kernel_mul_mat_q3_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -1632,7 +1557,7 @@ kernel void kernel_mul_mv_q3_K_f32(
 #endif

 #if QK_K == 256
-kernel void kernel_mul_mv_q4_K_f32(
+kernel void kernel_mul_mat_q4_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -1738,7 +1663,7 @@ kernel void kernel_mul_mv_q4_K_f32(
    }
 }
 #else
-kernel void kernel_mul_mv_q4_K_f32(
+kernel void kernel_mul_mat_q4_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -1827,7 +1752,7 @@ kernel void kernel_mul_mv_q4_K_f32(
 }
 #endif

-kernel void kernel_mul_mv_q5_K_f32(
+kernel void kernel_mul_mat_q5_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -2000,7 +1925,7 @@ kernel void kernel_mul_mv_q5_K_f32(

 }

-kernel void kernel_mul_mv_q6_K_f32(
+kernel void kernel_mul_mat_q6_K_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
@@ -2338,7 +2263,7 @@ kernel void kernel_get_rows(
 }

 #define BLOCK_SIZE_M 64 // 8 simdgroup matrices from matrix A
-#define BLOCK_SIZE_N 32 // 4 simdgroup matrices from matrix B
+#define BLOCK_SIZE_N 32 // 4 simdgroup matrices from matrix A
 #define BLOCK_SIZE_K 32
 #define THREAD_MAT_M 4 // each thread take 4 simdgroup matrices from matrix A
 #define THREAD_MAT_N 2 // each thread take 2 simdgroup matrices from matrix B
@@ -2375,11 +2300,9 @@ kernel void kernel_mul_mm(device const  uchar * src0,
    const uint r0 = tgpig.y;
    const uint r1 = tgpig.x;
    const uint im = tgpig.z;
-
    // if this block is of 64x32 shape or smaller
    short n_rows = (ne0 - r0 * BLOCK_SIZE_M < BLOCK_SIZE_M) ? (ne0 - r0 * BLOCK_SIZE_M) : BLOCK_SIZE_M;
    short n_cols = (ne1 - r1 * BLOCK_SIZE_N < BLOCK_SIZE_N) ? (ne1 - r1 * BLOCK_SIZE_N) : BLOCK_SIZE_N;
-
    // a thread shouldn't load data outside of the matrix
    short thread_row = ((short)tiitg/THREAD_PER_ROW) < n_rows ? ((short)tiitg/THREAD_PER_ROW) : n_rows - 1;
    short thread_col = ((short)tiitg/THREAD_PER_COL) < n_cols ? ((short)tiitg/THREAD_PER_COL) : n_cols - 1;
@@ -2403,30 +2326,26 @@ kernel void kernel_mul_mm(device const  uchar * src0,
        + nb10 * (BLOCK_SIZE_K / THREAD_PER_COL * (tiitg % THREAD_PER_COL)));

    for (int loop_k = 0; loop_k < ne00; loop_k += BLOCK_SIZE_K) {
-        // load data and store to threadgroup memory
+        //load data and store to threadgroup memory
        half4x4 temp_a;
        dequantize_func(x, il, temp_a);
        threadgroup_barrier(mem_flags::mem_threadgroup);
-
        #pragma unroll(16)
        for (int i = 0; i < 16; i++) {
            *(sa + SG_MAT_SIZE * ((tiitg / THREAD_PER_ROW / 8) \
-            +                     (tiitg % THREAD_PER_ROW) * 16 + (i / 8) * 8) \
-            +                     (tiitg / THREAD_PER_ROW) % 8  + (i & 7) * 8) = temp_a[i/4][i%4];
+            + 16 * (tiitg % THREAD_PER_ROW) + 8 * (i / 8)) \
+            + (tiitg / THREAD_PER_ROW) % 8 + (i & 7) * 8) = temp_a[i/4][i%4];
        }
-
-        *(threadgroup float2x4 *)(sb + (tiitg % THREAD_PER_COL) * 8 * 32 + 8 * (tiitg / THREAD_PER_COL)) = *((device float2x4 *)y);
-
+        *(threadgroup float2x4 *)(sb + (tiitg % THREAD_PER_COL) * 8 * 32 + 8 * (tiitg / THREAD_PER_COL)) \
+                = *((device float2x4 *)y);
        il = (il + 2 < nl) ? il + 2 : il % 2;
        x  = (il < 2) ? x + (2+nl-1)/nl : x;
        y += BLOCK_SIZE_K;

        threadgroup_barrier(mem_flags::mem_threadgroup);
-
-        // load matrices from threadgroup memory and conduct outer products
+        //load matrices from threadgroup memory and conduct outer products
        threadgroup half  * lsma = (sa + THREAD_MAT_M * SG_MAT_SIZE * (sgitg % 2));
        threadgroup float * lsmb = (sb + THREAD_MAT_N * SG_MAT_SIZE * (sgitg / 2));
-
        #pragma unroll(4)
        for (int ik = 0; ik < BLOCK_SIZE_K / 8; ik++) {
            #pragma unroll(4)
@@ -2441,7 +2360,6 @@ kernel void kernel_mul_mm(device const  uchar * src0,

            lsma += BLOCK_SIZE_M / SG_MAT_ROW * SG_MAT_SIZE;
            lsmb += BLOCK_SIZE_N / SG_MAT_ROW * SG_MAT_SIZE;
-
            #pragma unroll(8)
            for (int i = 0; i < 8; i++){
                simdgroup_multiply_accumulate(c_res[i], mb[i/4], ma[i%4], c_res[i]);
@@ -2450,26 +2368,25 @@ kernel void kernel_mul_mm(device const  uchar * src0,
    }

    if ((r0 + 1) * BLOCK_SIZE_M <= ne0 && (r1 + 1) * BLOCK_SIZE_N <= ne1) {
-        device float * C = dst + (BLOCK_SIZE_M * r0 + 32 * (sgitg &  1)) \
-                               + (BLOCK_SIZE_N * r1 + 16 * (sgitg >> 1)) * ne0 + im*ne1*ne0;
+        device float *C = dst + BLOCK_SIZE_M * r0 + 32 * (sgitg&1) \
+                          + (BLOCK_SIZE_N * r1 + 16 * (sgitg>>1)) * ne0 + im*ne1*ne0;
        for (int i = 0; i < 8; i++) {
            simdgroup_store(c_res[i], C + 8 * (i%4) + 8 * ne0 * (i/4), ne0);
        }
    } else {
        // block is smaller than 64x32, we should avoid writing data outside of the matrix
        threadgroup_barrier(mem_flags::mem_threadgroup);
-        threadgroup float * temp_str = ((threadgroup float *)shared_memory) \
+        threadgroup float *temp_str = ((threadgroup float *)shared_memory) \
                                      + 32 * (sgitg&1) + (16 * (sgitg>>1)) * BLOCK_SIZE_M;
        for (int i = 0; i < 8; i++) {
            simdgroup_store(c_res[i], temp_str + 8 * (i%4) + 8 * BLOCK_SIZE_M * (i/4), BLOCK_SIZE_M);
        }

        threadgroup_barrier(mem_flags::mem_threadgroup);
-
-        device float * C = dst + (BLOCK_SIZE_M * r0) + (BLOCK_SIZE_N * r1) * ne0 + im*ne1*ne0;
-        if (sgitg == 0) {
+        device float *C = dst + BLOCK_SIZE_M * r0 + (BLOCK_SIZE_N * r1) * ne0 + im*ne1*ne0;
+        if (sgitg==0) {
            for (int i = 0; i < n_rows; i++) {
-                for (int j = tiitg; j < n_cols; j += BLOCK_SIZE_N) {
+                for (int j = tiitg; j< n_cols; j += BLOCK_SIZE_N) {
                    *(C + i + j * ne0) = *(temp_str + i + j * BLOCK_SIZE_M);
                }
            }
--- a/ggml.c
+++ b/ggml.c
@@ -11256,7 +11256,7 @@ static void ggml_compute_forward_silu_f32(

 #ifndef NDEBUG
        for (int k = 0; k < nc; k++) {
-            const float x = ((float *) ((char *) dst->data + i1*(dst->nb[1])))[k];
+            const float x = ((float *) ((char *) dst->data + i1*( dst->nb[1])))[k];
            UNUSED(x);
            assert(!isnan(x));
            assert(!isinf(x));
@@ -13089,17 +13089,17 @@ static void ggml_compute_forward_alibi_f32(

    assert(n_past >= 0);

-    const int64_t ne0 = src0->ne[0]; // all_seq_len = n_past + ne1
-    const int64_t ne1 = src0->ne[1]; // seq_len_without_past
-    const int64_t ne2 = src0->ne[2]; // n_head -> this is k
-    //const int64_t ne3 = src0->ne[3]; // 1 -> bsz
+    const int ne0 = src0->ne[0]; // all_seq_len = n_past + ne1
+    const int ne1 = src0->ne[1]; // seq_len_without_past
+    const int ne2 = src0->ne[2]; // n_head -> this is k
+    //const int ne3 = src0->ne[3]; // 1 -> bsz

-    const int64_t n  = ggml_nrows(src0);
-    const int64_t ne2_ne3 = n/ne1; // ne2*ne3
+    const int n  = ggml_nrows(src0);
+    const int ne2_ne3 = n/ne1; // ne2*ne3

-    const size_t nb0 = src0->nb[0];
-    const size_t nb1 = src0->nb[1];
-    const size_t nb2 = src0->nb[2];
+    const int nb0 = src0->nb[0];
+    const int nb1 = src0->nb[1];
+    const int nb2 = src0->nb[2];
    //const int nb3 = src0->nb[3];

    GGML_ASSERT(nb0 == sizeof(float));
@@ -13111,9 +13111,9 @@ static void ggml_compute_forward_alibi_f32(
    const float m0 = powf(2.0f, -(max_bias) / n_heads_log2_floor);
    const float m1 = powf(2.0f, -(max_bias / 2.0f) / n_heads_log2_floor);

-    for (int64_t i = 0; i < ne0; i++) {
-        for (int64_t j = 0; j < ne1; j++) {
-            for (int64_t k = 0; k < ne2_ne3; k++) {
+    for (int i = 0; i < ne0; i++) {
+        for (int j = 0; j < ne1; j++) {
+            for (int k = 0; k < ne2_ne3; k++) {
                float * const src = (float *)((char *) src0->data + i*nb0 + j*nb1 + k*nb2);
                float *      pdst = (float *)((char *)  dst->data + i*nb0 + j*nb1 + k*nb2);

@@ -13128,6 +13128,7 @@ static void ggml_compute_forward_alibi_f32(
                }

                pdst[0] = i * m_k + src[0];
+
            }
        }
    }
--- a/gguf-py/README.md
+++ b/gguf-py/README.md
@@ -69,3 +69,4 @@ python -m twine upload dist/*
 ## TODO
 - [ ] Add tests
 - [ ] Include conversion scripts as command line entry points in this package.
+- Add CI workflow for releasing the package.
--- a/gguf-py/gguf/gguf.py
+++ b/gguf-py/gguf/gguf.py
@@ -85,7 +85,6 @@ class MODEL_ARCH(IntEnum):
    GPTNEOX       : int = auto()
    MPT           : int = auto()
    STARCODER     : int = auto()
-    PERSIMMON     : int = auto()
    REFACT        : int = auto()
    BERT          : int = auto()

@@ -109,8 +108,6 @@ class MODEL_TENSOR(IntEnum):
    FFN_DOWN     : int = auto()
    FFN_UP       : int = auto()
    FFN_NORM     : int = auto()
-    ATTN_Q_NORM  : int = auto()
-    ATTN_K_NORM  : int = auto()


 MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
@@ -122,7 +119,6 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
    MODEL_ARCH.GPTNEOX:        "gptneox",
    MODEL_ARCH.MPT:            "mpt",
    MODEL_ARCH.STARCODER:      "starcoder",
-    MODEL_ARCH.PERSIMMON:      "persimmon",
    MODEL_ARCH.REFACT:         "refact",
    MODEL_ARCH.BERT:           "bert",
 }
@@ -134,6 +130,7 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
    MODEL_TENSOR.OUTPUT_NORM:   "output_norm",
    MODEL_TENSOR.OUTPUT:        "output",
    MODEL_TENSOR.ROPE_FREQS:    "rope_freqs",
+
    MODEL_TENSOR.ATTN_NORM:     "blk.{bid}.attn_norm",
    MODEL_TENSOR.ATTN_NORM_2:   "blk.{bid}.attn_norm_2",
    MODEL_TENSOR.ATTN_QKV:      "blk.{bid}.attn_qkv",
@@ -142,8 +139,6 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
    MODEL_TENSOR.ATTN_V:        "blk.{bid}.attn_v",
    MODEL_TENSOR.ATTN_OUT:      "blk.{bid}.attn_output",
    MODEL_TENSOR.ATTN_ROT_EMBD: "blk.{bid}.attn_rot_embd",
-    MODEL_TENSOR.ATTN_Q_NORM:   "blk.{bid}.attn_q_norm",
-    MODEL_TENSOR.ATTN_K_NORM:   "blk.{bid}.attn_k_norm",
    MODEL_TENSOR.FFN_NORM:      "blk.{bid}.ffn_norm",
    MODEL_TENSOR.FFN_GATE:      "blk.{bid}.ffn_gate",
    MODEL_TENSOR.FFN_DOWN:      "blk.{bid}.ffn_down",
@@ -254,20 +249,6 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
        MODEL_TENSOR.FFN_DOWN,
        MODEL_TENSOR.FFN_UP,
    ],
-    MODEL_ARCH.PERSIMMON: [
-        MODEL_TENSOR.TOKEN_EMBD,
-        MODEL_TENSOR.OUTPUT,
-        MODEL_TENSOR.OUTPUT_NORM,
-        MODEL_TENSOR.ATTN_NORM,
-        MODEL_TENSOR.ATTN_QKV,
-        MODEL_TENSOR.ATTN_OUT,
-        MODEL_TENSOR.FFN_NORM,
-        MODEL_TENSOR.FFN_DOWN,
-        MODEL_TENSOR.FFN_UP,
-        MODEL_TENSOR.ATTN_Q_NORM,
-        MODEL_TENSOR.ATTN_K_NORM,
-        MODEL_TENSOR.ATTN_ROT_EMBD,
-    ],
    MODEL_ARCH.REFACT: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
@@ -298,9 +279,6 @@ MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
        MODEL_TENSOR.ROPE_FREQS,
        MODEL_TENSOR.ATTN_ROT_EMBD,
    ],
-    MODEL_ARCH.PERSIMMON: [
-        MODEL_TENSOR.ROPE_FREQS,
-    ]
 }


@@ -308,13 +286,12 @@ class TensorNameMap:
    mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
        # Token embeddings
        MODEL_TENSOR.TOKEN_EMBD: (
-            "gpt_neox.embed_in",                        # gptneox
-            "transformer.wte",                          # gpt2 gpt-j mpt refact
-            "transformer.word_embeddings",              # falcon
-            "model.embed_tokens",                       # llama-hf
-            "tok_embeddings",                           # llama-pth
-            "embeddings.word_embeddings",               # bert
-            "language_model.embedding.word_embeddings", # persimmon
+            "gpt_neox.embed_in",            # gptneox
+            "transformer.wte",              # gpt2 gpt-j mpt refact
+            "transformer.word_embeddings",  # falcon
+            "model.embed_tokens",           # llama-hf
+            "tok_embeddings",               # llama-pth
+            "embeddings.word_embeddings",   # bert
        ),

        # Token type embeddings
@@ -330,22 +307,20 @@ class TensorNameMap:

        # Output
        MODEL_TENSOR.OUTPUT: (
-            "embed_out",                # gptneox
-            "lm_head",                  # gpt2 mpt falcon llama-hf baichuan
-            "output",                   # llama-pth
-            "word_embeddings_for_head", # persimmon
+            "embed_out",  # gptneox
+            "lm_head",    # gpt2 gpt-j mpt falcon llama-hf baichuan
+            "output",     # llama-pth
        ),

        # Output norm
        MODEL_TENSOR.OUTPUT_NORM: (
-            "gpt_neox.final_layer_norm",              # gptneox
-            "transformer.ln_f",                       # gpt2 gpt-j falcon
-            "model.norm",                             # llama-hf baichuan
-            "norm",                                   # llama-pth
-            "embeddings.LayerNorm",                   # bert
-            "transformer.norm_f",                     # mpt
-            "ln_f",                                   # refact
-            "language_model.encoder.final_layernorm", # persimmon
+            "gpt_neox.final_layer_norm",  # gptneox
+            "transformer.ln_f",           # gpt2 gpt-j falcon
+            "model.norm",                 # llama-hf baichuan
+            "norm",                       # llama-pth
+            "embeddings.LayerNorm",       # bert
+            "transformer.norm_f",         # mpt
+            "ln_f",                       # refact
        ),

        # Rope frequencies
@@ -357,15 +332,14 @@ class TensorNameMap:
    block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
        # Attention norm
        MODEL_TENSOR.ATTN_NORM: (
-            "gpt_neox.layers.{bid}.input_layernorm",               # gptneox
-            "transformer.h.{bid}.ln_1",                            # gpt2 gpt-j refact
-            "transformer.blocks.{bid}.norm_1",                     # mpt
-            "transformer.h.{bid}.input_layernorm",                 # falcon7b
-            "transformer.h.{bid}.ln_mlp",                          # falcon40b
-            "model.layers.{bid}.input_layernorm",                  # llama-hf
-            "layers.{bid}.attention_norm",                         # llama-pth
-            "encoder.layer.{bid}.attention.output.LayerNorm",      # bert
-            "language_model.encoder.layers.{bid}.input_layernorm", # persimmon
+            "gpt_neox.layers.{bid}.input_layernorm",           # gptneox
+            "transformer.h.{bid}.ln_1",                        # gpt2 gpt-j refact
+            "transformer.blocks.{bid}.norm_1",                 # mpt
+            "transformer.h.{bid}.input_layernorm",             # falcon7b
+            "transformer.h.{bid}.ln_mlp",                      # falcon40b
+            "model.layers.{bid}.input_layernorm",              # llama-hf
+            "layers.{bid}.attention_norm",                     # llama-pth
+            "encoder.layer.{bid}.attention.output.LayerNorm",  # bert
        ),

        # Attention norm 2
@@ -375,11 +349,10 @@ class TensorNameMap:

        # Attention query-key-value
        MODEL_TENSOR.ATTN_QKV: (
-            "gpt_neox.layers.{bid}.attention.query_key_value",                    # gptneox
-            "transformer.h.{bid}.attn.c_attn",                                    # gpt2
-            "transformer.blocks.{bid}.attn.Wqkv",                                 # mpt
-            "transformer.h.{bid}.self_attention.query_key_value",                 # falcon
-            "language_model.encoder.layers.{bid}.self_attention.query_key_value", # persimmon
+            "gpt_neox.layers.{bid}.attention.query_key_value",     # gptneox
+            "transformer.h.{bid}.attn.c_attn",                     # gpt2
+            "transformer.blocks.{bid}.attn.Wqkv",                  # mpt
+            "transformer.h.{bid}.self_attention.query_key_value",  # falcon
        ),

        # Attention query
@@ -408,15 +381,14 @@ class TensorNameMap:

        # Attention output
        MODEL_TENSOR.ATTN_OUT: (
-            "gpt_neox.layers.{bid}.attention.dense",                   # gptneox
-            "transformer.h.{bid}.attn.c_proj",                         # gpt2 refact
-            "transformer.blocks.{bid}.attn.out_proj",                  # mpt
-            "transformer.h.{bid}.self_attention.dense",                # falcon
-            "model.layers.{bid}.self_attn.o_proj",                     # llama-hf
-            "layers.{bid}.attention.wo",                               # llama-pth
-            "encoder.layer.{bid}.attention.output.dense",              # bert
-            "transformer.h.{bid}.attn.out_proj",                       # gpt-j
-            "language_model.encoder.layers.{bid}.self_attention.dense" # persimmon
+            "gpt_neox.layers.{bid}.attention.dense",       # gptneox
+            "transformer.h.{bid}.attn.c_proj",             # gpt2 refact
+            "transformer.blocks.{bid}.attn.out_proj",      # mpt
+            "transformer.h.{bid}.self_attention.dense",    # falcon
+            "model.layers.{bid}.self_attn.o_proj",         # llama-hf
+            "layers.{bid}.attention.wo",                   # llama-pth
+            "encoder.layer.{bid}.attention.output.dense",  # bert
+            "transformer.h.{bid}.attn.out_proj",           # gpt-j
        ),

        # Rotary embeddings
@@ -427,26 +399,24 @@ class TensorNameMap:

        # Feed-forward norm
        MODEL_TENSOR.FFN_NORM: (
-            "gpt_neox.layers.{bid}.post_attention_layernorm",               # gptneox
-            "transformer.h.{bid}.ln_2",                                     # gpt2 refact
-            "transformer.blocks.{bid}.norm_2",                              # mpt
-            "model.layers.{bid}.post_attention_layernorm",                  # llama-hf
-            "layers.{bid}.ffn_norm",                                        # llama-pth
-            "encoder.layer.{bid}.output.LayerNorm",                         # bert
-            "language_model.encoder.layers.{bid}.post_attention_layernorm", # persimmon
+            "gpt_neox.layers.{bid}.post_attention_layernorm",  # gptneox
+            "transformer.h.{bid}.ln_2",                        # gpt2 refact
+            "transformer.blocks.{bid}.norm_2",                 # mpt
+            "model.layers.{bid}.post_attention_layernorm",     # llama-hf
+            "layers.{bid}.ffn_norm",                           # llama-pth
+            "encoder.layer.{bid}.output.LayerNorm",            # bert
        ),

        # Feed-forward up
        MODEL_TENSOR.FFN_UP: (
-            "gpt_neox.layers.{bid}.mlp.dense_h_to_4h",               # gptneox
-            "transformer.h.{bid}.mlp.c_fc",                          # gpt2
-            "transformer.blocks.{bid}.ffn.up_proj",                  # mpt
-            "transformer.h.{bid}.mlp.dense_h_to_4h",                 # falcon
-            "model.layers.{bid}.mlp.up_proj",                        # llama-hf refact
-            "layers.{bid}.feed_forward.w3",                          # llama-pth
-            "encoder.layer.{bid}.intermediate.dense",                # bert
-            "transformer.h.{bid}.mlp.fc_in",                         # gpt-j
-            "language_model.encoder.layers.{bid}.mlp.dense_h_to_4h", # persimmon
+            "gpt_neox.layers.{bid}.mlp.dense_h_to_4h",  # gptneox
+            "transformer.h.{bid}.mlp.c_fc",             # gpt2
+            "transformer.blocks.{bid}.ffn.up_proj",     # mpt
+            "transformer.h.{bid}.mlp.dense_h_to_4h",    # falcon
+            "model.layers.{bid}.mlp.up_proj",           # llama-hf refact
+            "layers.{bid}.feed_forward.w3",             # llama-pth
+            "encoder.layer.{bid}.intermediate.dense",   # bert
+            "transformer.h.{bid}.mlp.fc_in",            # gpt-j
        ),

        # Feed-forward gate
@@ -457,28 +427,15 @@ class TensorNameMap:

        # Feed-forward down
        MODEL_TENSOR.FFN_DOWN: (
-            "gpt_neox.layers.{bid}.mlp.dense_4h_to_h",               # gptneox
-            "transformer.h.{bid}.mlp.c_proj",                        # gpt2 refact
-            "transformer.blocks.{bid}.ffn.down_proj",                # mpt
-            "transformer.h.{bid}.mlp.dense_4h_to_h",                 # falcon
-            "model.layers.{bid}.mlp.down_proj",                      # llama-hf
-            "layers.{bid}.feed_forward.w2",                          # llama-pth
-            "encoder.layer.{bid}.output.dense",                      # bert
-            "transformer.h.{bid}.mlp.fc_out",                        # gpt-j
-            "language_model.encoder.layers.{bid}.mlp.dense_4h_to_h", # persimmon
+            "gpt_neox.layers.{bid}.mlp.dense_4h_to_h",  # gptneox
+            "transformer.h.{bid}.mlp.c_proj",           # gpt2 refact
+            "transformer.blocks.{bid}.ffn.down_proj",   # mpt
+            "transformer.h.{bid}.mlp.dense_4h_to_h",    # falcon
+            "model.layers.{bid}.mlp.down_proj",         # llama-hf
+            "layers.{bid}.feed_forward.w2",             # llama-pth
+            "encoder.layer.{bid}.output.dense",         # bert
+            "transformer.h.{bid}.mlp.fc_out",           # gpt-j
        ),
-
-        MODEL_TENSOR.ATTN_Q_NORM: (
-            "language_model.encoder.layers.{bid}.self_attention.q_layernorm",
-        ),
-
-        MODEL_TENSOR.ATTN_K_NORM: (
-            "language_model.encoder.layers.{bid}.self_attention.k_layernorm",
-        ),
-
-        MODEL_TENSOR.ROPE_FREQS: (
-            "language_model.encoder.layers.{bid}.self_attention.rotary_emb.inv_freq", # persimmon
-        )
    }

    mapping: dict[str, tuple[MODEL_TENSOR, str]]
--- a/gguf-py/pyproject.toml
+++ b/gguf-py/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "gguf"
-version = "0.4.4"
+version = "0.4.0"
 description = "Write ML models in GGUF for GGML"
 authors = ["GGML <ggml@ggml.ai>"]
 packages = [
--- a/llama.cpp
+++ b/llama.cpp
@@ -125,27 +125,6 @@ static void replace_all(std::string & s, const std::string & search, const std::
    }
    s = std::move(result);
 }
-
-static bool is_float_close(float a, float b, float abs_tol) {
-    // Check for non-negative tolerance
-    if (abs_tol < 0.0) {
-        throw std::invalid_argument("Tolerance must be non-negative");
-    }
-
-    // Exact equality check
-    if (a == b) {
-        return true;
-    }
-
-    // Check for infinities
-    if (std::isinf(a) || std::isinf(b)) {
-        return false;
-    }
-
-    // Regular comparison using the provided absolute tolerance
-    return std::fabs(b - a) <= abs_tol;
-}
-
 #ifdef GGML_USE_CPU_HBM
 #include <hbwmalloc.h>
 #endif
@@ -186,7 +165,6 @@ enum llm_arch {
    LLM_ARCH_GPTNEOX,
    LLM_ARCH_MPT,
    LLM_ARCH_STARCODER,
-    LLM_ARCH_PERSIMMON,
    LLM_ARCH_REFACT,
    LLM_ARCH_UNKNOWN,
 };
@@ -200,7 +178,6 @@ static std::map<llm_arch, std::string> LLM_ARCH_NAMES = {
    { LLM_ARCH_MPT,             "mpt"       },
    { LLM_ARCH_BAICHUAN,        "baichuan"  },
    { LLM_ARCH_STARCODER,       "starcoder" },
-    { LLM_ARCH_PERSIMMON,       "persimmon" },
    { LLM_ARCH_REFACT,          "refact" },
 };

@@ -320,8 +297,6 @@ enum llm_tensor {
    LLM_TENSOR_FFN_DOWN,
    LLM_TENSOR_FFN_UP,
    LLM_TENSOR_FFN_NORM,
-    LLM_TENSOR_ATTN_Q_NORM,
-    LLM_TENSOR_ATTN_K_NORM,
 };

 static std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
@@ -403,23 +378,6 @@ static std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES =
            { LLM_TENSOR_FFN_UP,          "blk.%d.ffn_up" },
        },
    },
-    {
-        LLM_ARCH_PERSIMMON,
-        {
-            { LLM_TENSOR_TOKEN_EMBD,      "token_embd"},
-            { LLM_TENSOR_OUTPUT_NORM,     "output_norm"},
-            { LLM_TENSOR_OUTPUT,          "output"},
-            { LLM_TENSOR_ATTN_NORM,       "blk.%d.attn_norm"},
-            { LLM_TENSOR_ATTN_QKV,        "blk.%d.attn_qkv"},
-            { LLM_TENSOR_ATTN_OUT,        "blk.%d.attn_output"},
-            { LLM_TENSOR_ATTN_Q_NORM,     "blk.%d.attn_q_norm"},
-            { LLM_TENSOR_ATTN_K_NORM,     "blk.%d.attn_k_norm"},
-            { LLM_TENSOR_FFN_NORM,        "blk.%d.ffn_norm"},
-            { LLM_TENSOR_FFN_DOWN,        "blk.%d.ffn_down"},
-            { LLM_TENSOR_FFN_UP,          "blk.%d.ffn_up"},
-            { LLM_TENSOR_ATTN_ROT_EMBD,   "blk.%d.attn_rot_embd"},
-        },
-    },
    {
        LLM_ARCH_MPT,
        {
@@ -980,7 +938,6 @@ enum e_model {
    MODEL_1B,
    MODEL_3B,
    MODEL_7B,
-    MODEL_8B,
    MODEL_13B,
    MODEL_15B,
    MODEL_30B,
@@ -1012,24 +969,7 @@ struct llama_hparams {
    float rope_freq_scale_train;

    bool operator!=(const llama_hparams & other) const {
-        if (this->vocab_only != other.vocab_only) return true;
-        if (this->n_vocab != other.n_vocab) return true;
-        if (this->n_ctx_train != other.n_ctx_train) return true;
-        if (this->n_embd != other.n_embd) return true;
-        if (this->n_head != other.n_head) return true;
-        if (this->n_head_kv != other.n_head_kv) return true;
-        if (this->n_layer != other.n_layer) return true;
-        if (this->n_rot != other.n_rot) return true;
-        if (this->n_ff != other.n_ff) return true;
-
-        const float EPSILON = 1e-9;
-
-        if (!is_float_close(this->f_norm_eps, other.f_norm_eps, EPSILON)) return true;
-        if (!is_float_close(this->f_norm_rms_eps, other.f_norm_rms_eps, EPSILON)) return true;
-        if (!is_float_close(this->rope_freq_base_train, other.rope_freq_base_train, EPSILON)) return true;
-        if (!is_float_close(this->rope_freq_scale_train, other.rope_freq_scale_train, EPSILON)) return true;
-
-        return false;
+        return static_cast<bool>(memcmp(this, &other, sizeof(llama_hparams))); // NOLINT
    }

    uint32_t n_gqa() const {
@@ -1063,10 +1003,6 @@ struct llama_layer {
    struct ggml_tensor * attn_norm_b;
    struct ggml_tensor * attn_norm_2;
    struct ggml_tensor * attn_norm_2_b;
-    struct ggml_tensor * attn_q_norm;
-    struct ggml_tensor * attn_q_norm_b;
-    struct ggml_tensor * attn_k_norm;
-    struct ggml_tensor * attn_k_norm_b;

    // attention
    struct ggml_tensor * wq;
@@ -1108,9 +1044,6 @@ struct llama_kv_cell {
 struct llama_kv_cache {
    bool has_shift = false;

-    // Note: The value of head isn't only used to optimize searching
-    // for a free KV slot. llama_decode_internal also uses it, so it
-    // cannot be freely changed after a slot has been allocated.
    uint32_t head = 0;
    uint32_t size = 0;

@@ -1325,11 +1258,7 @@ static bool llama_kv_cache_init(
    cache.cells.clear();
    cache.cells.resize(n_ctx);

-    // TODO: this should be:
-    //       cache.buf.resize(2u*n_elements*ggml_type_size(wtype) + 2u*ggml_tensor_overhead());
-    //       change it and test that it works
    cache.buf.resize(2u*n_elements*ggml_type_size(wtype) + 2u*MB);
-    memset(cache.buf.data, 0, cache.buf.size);

    struct ggml_init_params params;
    params.mem_size   = cache.buf.size;
@@ -1372,8 +1301,6 @@ static bool llama_kv_cache_init(

 // find an empty slot of size "n_tokens" in the cache
 // updates the cache head
-// Note: On success, it's important that cache.head points
-// to the first cell of the slot.
 static bool llama_kv_cache_find_slot(
           struct llama_kv_cache & cache,
        const struct llama_batch & batch) {
@@ -1389,8 +1316,8 @@ static bool llama_kv_cache_find_slot(

    while (true) {
        if (cache.head + n_tokens > n_ctx) {
-            n_tested += n_ctx - cache.head;
            cache.head = 0;
+            n_tested   += n_ctx - cache.head;
            continue;
        }

@@ -1441,9 +1368,6 @@ static void llama_kv_cache_tokens_rm(struct llama_kv_cache & cache, int32_t c0,
        cache.cells[i].pos = -1;
        cache.cells[i].seq_id.clear();
    }
-
-    // Searching for a free slot can start here since we know it will be empty.
-    cache.head = uint32_t(c0);
 }

 static void llama_kv_cache_seq_rm(
@@ -1451,8 +1375,6 @@ static void llama_kv_cache_seq_rm(
                 llama_seq_id   seq_id,
                    llama_pos   p0,
                    llama_pos   p1) {
-    uint32_t new_head = cache.size;
-
    if (p0 < 0) p0 = 0;
    if (p1 < 0) p1 = std::numeric_limits<llama_pos>::max();

@@ -1461,13 +1383,9 @@ static void llama_kv_cache_seq_rm(
            cache.cells[i].seq_id.erase(seq_id);
            if (cache.cells[i].seq_id.empty()) {
                cache.cells[i].pos = -1;
-                if (new_head == cache.size) new_head = i;
            }
        }
    }
-
-    // If we freed up a slot, set head to it so searching can start there.
-    if (new_head != cache.size) cache.head = new_head;
 }

 static void llama_kv_cache_seq_cp(
@@ -1479,8 +1397,6 @@ static void llama_kv_cache_seq_cp(
    if (p0 < 0) p0 = 0;
    if (p1 < 0) p1 = std::numeric_limits<llama_pos>::max();

-    cache.head = 0;
-
    for (uint32_t i = 0; i < cache.size; ++i) {
        if (cache.cells[i].has_seq_id(seq_id_src) && cache.cells[i].pos >= p0 && cache.cells[i].pos < p1) {
            cache.cells[i].seq_id.insert(seq_id_dst);
@@ -1489,18 +1405,12 @@ static void llama_kv_cache_seq_cp(
 }

 static void llama_kv_cache_seq_keep(struct llama_kv_cache & cache, llama_seq_id seq_id) {
-    uint32_t new_head = cache.size;
-
    for (uint32_t i = 0; i < cache.size; ++i) {
        if (!cache.cells[i].has_seq_id(seq_id)) {
            cache.cells[i].pos = -1;
            cache.cells[i].seq_id.clear();
-            if (new_head == cache.size) new_head = i;
        }
    }
-
-    // If we freed up a slot, set head to it so searching can start there.
-    if (new_head != cache.size) cache.head = new_head;
 }

 static void llama_kv_cache_seq_shift(
@@ -1509,8 +1419,6 @@ static void llama_kv_cache_seq_shift(
                    llama_pos   p0,
                    llama_pos   p1,
                    llama_pos   delta) {
-    uint32_t new_head = cache.size;
-
    if (p0 < 0) p0 = 0;
    if (p1 < 0) p1 = std::numeric_limits<llama_pos>::max();

@@ -1520,17 +1428,12 @@ static void llama_kv_cache_seq_shift(
            if (cache.cells[i].pos < 0) {
                cache.cells[i].pos = -1;
                cache.cells[i].seq_id.clear();
-                if (new_head == cache.size) new_head = i;
            } else {
                cache.has_shift = true;
                cache.cells[i].delta = delta;
            }
        }
    }
-
-    // If we freed up a slot, set head to it so searching can start there.
-    // Otherwise we just start the next search from the beginning.
-    cache.head = new_head != cache.size ? new_head : 0;
 }

 //
@@ -1931,7 +1834,6 @@ static const char * llama_model_type_name(e_model type) {
        case MODEL_1B:  return "1B";
        case MODEL_3B:  return "3B";
        case MODEL_7B:  return "7B";
-        case MODEL_8B:  return "8B";
        case MODEL_13B: return "13B";
        case MODEL_15B: return "15B";
        case MODEL_30B: return "30B";
@@ -2044,14 +1946,6 @@ static void llm_load_hparams(
                    default: model.type = e_model::MODEL_UNKNOWN;
                }
            } break;
-        case LLM_ARCH_PERSIMMON:
-        {
-            GGUF_GET_KEY(ctx, hparams.f_norm_eps, gguf_get_val_f32, GGUF_TYPE_FLOAT32, true, kv(LLM_KV_ATTENTION_LAYERNORM_EPS));
-            switch (hparams.n_layer) {
-                case 36: model.type = e_model::MODEL_8B; break;
-                default: model.type = e_model::MODEL_UNKNOWN;
-            }
-        } break;
        case LLM_ARCH_REFACT:
            {
                GGUF_GET_KEY(ctx, hparams.f_norm_rms_eps, gguf_get_val_f32, GGUF_TYPE_FLOAT32, true, kv(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS));
@@ -2588,67 +2482,6 @@ static void llm_load_tensors(
                        }
                    }
                } break;
-            case LLM_ARCH_PERSIMMON:
-                {
-                    model.tok_embeddings = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD, "weight"),  {n_embd, n_vocab}, GGML_BACKEND_CPU);
-
-                    {
-                        ggml_backend backend_norm;
-                        ggml_backend backend_output;
-
-                        if (n_gpu_layers > int(n_layer)) {
-                            // norm is not performance relevant on its own but keeping it in VRAM reduces data copying
-                            // on Windows however this is detrimental unless everything is on the GPU
-#ifndef _WIN32
-                            backend_norm = LLAMA_BACKEND_OFFLOAD;
-#else
-                            backend_norm = n_gpu_layers <= (int) n_layer + 2 ? GGML_BACKEND_CPU : LLAMA_BACKEND_OFFLOAD;
-#endif // _WIN32
-
-                            backend_output = LLAMA_BACKEND_OFFLOAD_SPLIT;
-                        } else {
-                            backend_norm   = GGML_BACKEND_CPU;
-                            backend_output = GGML_BACKEND_CPU;
-                        }
-
-                        model.output_norm    = ml.create_tensor(ctx, tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd},          backend_norm);
-                        model.output_norm_b  = ml.create_tensor(ctx, tn(LLM_TENSOR_OUTPUT_NORM, "bias"),   {n_embd},          backend_norm);
-                        model.output         = ml.create_tensor(ctx, tn(LLM_TENSOR_OUTPUT,      "weight"), {n_embd, n_vocab}, backend_output);
-
-                        if (backend_norm == GGML_BACKEND_GPU) {
-                            vram_weights += ggml_nbytes(model.output_norm);
-                            vram_weights += ggml_nbytes(model.output_norm_b);
-                        }
-                        if (backend_output == GGML_BACKEND_GPU_SPLIT) {
-                            vram_weights += ggml_nbytes(model.output);
-                        }
-                    }
-
-                    const uint32_t n_ff = hparams.n_ff;
-                    const int i_gpu_start = n_layer - n_gpu_layers;
-                    model.layers.resize(n_layer);
-                    for (uint32_t i = 0; i < n_layer; ++i) {
-                        const ggml_backend backend = int(i) < i_gpu_start ? GGML_BACKEND_CPU : LLAMA_BACKEND_OFFLOAD;
-                        const ggml_backend backend_split = int(i) < i_gpu_start ? GGML_BACKEND_CPU : LLAMA_BACKEND_OFFLOAD_SPLIT;
-                        auto & layer = model.layers[i];
-                        layer.attn_norm   = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, backend);
-                        layer.attn_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_NORM, "bias", i),   {n_embd}, backend);
-                        layer.wqkv = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_QKV, "weight", i), {n_embd, n_embd + 2*n_embd_gqa}, backend_split);
-                        layer.bqkv = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_QKV, "bias", i),   {n_embd + 2*n_embd_gqa},         backend_split);
-                        layer.wo   = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_OUT, "weight", i), {n_embd, n_embd},   backend_split);
-                        layer.bo   = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_OUT, "bias", i),   {n_embd},           backend_split);
-                        layer.w2 = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {n_ff, n_embd}, backend_split);
-                        layer.b2 = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_DOWN, "bias", i),   {n_embd},       backend_split);
-                        layer.w3 = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff}, backend_split);
-                        layer.b3 = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_UP,   "bias", i),   {n_ff},           backend_split);
-                        layer.ffn_norm   = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, backend);
-                        layer.ffn_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_FFN_NORM, "bias", i),   {n_embd}, backend);
-                        layer.attn_q_norm   = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_Q_NORM, "weight", i), {64}, backend);
-                        layer.attn_q_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_Q_NORM, "bias", i),   {64}, backend);
-                        layer.attn_k_norm   = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_K_NORM, "weight", i), {64}, backend);
-                        layer.attn_k_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_ATTN_K_NORM, "bias", i),   {64}, backend);
-                    }
-                } break;
            default:
                throw std::runtime_error("unknown architecture");
        }
@@ -2758,8 +2591,8 @@ static bool llama_model_load(
 }

 static struct ggml_cgraph * llm_build_llama(
-    llama_context & lctx,
-    const llama_batch & batch) {
+         llama_context & lctx,
+     const llama_batch & batch) {
    const auto & model   = lctx.model;
    const auto & hparams = model.hparams;
    const auto & cparams = lctx.cparams;
@@ -2797,9 +2630,11 @@ static struct ggml_cgraph * llm_build_llama(
    struct ggml_init_params params = {
        /*.mem_size   =*/ buf_compute.size,
        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
+        /*.no_alloc   =*/ false,
    };

+    params.no_alloc = true;
+
    struct ggml_context * ctx0 = ggml_init(params);

    ggml_cgraph * gf = ggml_new_graph(ctx0);
@@ -3183,9 +3018,11 @@ static struct ggml_cgraph * llm_build_baichaun(
    struct ggml_init_params params = {
        /*.mem_size   =*/ buf_compute.size,
        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
+        /*.no_alloc   =*/ false,
    };

+    params.no_alloc = true;
+
    struct ggml_context * ctx0 = ggml_init(params);

    ggml_cgraph * gf = ggml_new_graph(ctx0);
@@ -3582,9 +3419,11 @@ static struct ggml_cgraph * llm_build_refact(
    struct ggml_init_params params = {
        /*.mem_size   =*/ buf_compute.size,
        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
+        /*.no_alloc   =*/ false,
    };

+    params.no_alloc = true;
+
    struct ggml_context * ctx0 = ggml_init(params);

    ggml_cgraph * gf = ggml_new_graph(ctx0);
@@ -3934,9 +3773,11 @@ static struct ggml_cgraph * llm_build_falcon(
    struct ggml_init_params params = {
        /*.mem_size   =*/ buf_compute.size,
        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
+        /*.no_alloc   =*/ false,
    };

+    params.no_alloc = true;
+
    struct ggml_context * ctx0 = ggml_init(params);

    ggml_cgraph * gf = ggml_new_graph(ctx0);
@@ -4292,9 +4133,11 @@ static struct ggml_cgraph * llm_build_starcoder(
    struct ggml_init_params params = {
        /*.mem_size   =*/ buf_compute.size,
        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
+        /*.no_alloc   =*/ false,
    };

+    params.no_alloc = true;
+
    struct ggml_context * ctx0 = ggml_init(params);

    ggml_cgraph * gf = ggml_new_graph(ctx0);
@@ -4505,404 +4348,6 @@ static struct ggml_cgraph * llm_build_starcoder(
    return gf;
 }

-
-static struct ggml_cgraph * llm_build_persimmon(
-         llama_context & lctx,
-     const llama_batch & batch) {
-    const auto & model = lctx.model;
-    const auto & hparams = model.hparams;
-
-    const auto & kv_self = lctx.kv_self;
-
-    GGML_ASSERT(!!kv_self.ctx);
-
-    const auto & cparams = lctx.cparams;
-    const int64_t n_embd      = hparams.n_embd;
-    const int64_t n_layer     = hparams.n_layer;
-    const int64_t n_ctx       = cparams.n_ctx;
-    const int64_t n_head_kv   = hparams.n_head_kv;
-    const int64_t n_head      = hparams.n_head;
-    const int64_t n_embd_head = hparams.n_embd_head();
-    const int64_t n_embd_gqa  = hparams.n_embd_gqa();
-    const size_t n_rot        = n_embd_head / 2;
-
-    const float freq_base  = cparams.rope_freq_base;
-    const float freq_scale = cparams.rope_freq_scale;
-    const float norm_eps = hparams.f_norm_eps;
-
-    const int n_gpu_layers = model.n_gpu_layers;
-
-
-    const int32_t n_tokens    = batch.n_tokens;
-    const int32_t n_kv        = ggml_allocr_is_measure(lctx.alloc) ? n_ctx            : kv_self.n;
-    const int32_t kv_head     = ggml_allocr_is_measure(lctx.alloc) ? n_ctx - n_tokens : kv_self.head;
-
-    const bool do_rope_shift  = ggml_allocr_is_measure(lctx.alloc) || kv_self.has_shift;
-
-    auto & buf_compute = lctx.buf_compute;
-    struct ggml_init_params params = {
-        /*.mem_size   =*/ buf_compute.size,
-        /*.mem_buffer =*/ buf_compute.data,
-        /*.no_alloc   =*/ true,
-    };
-
-    struct ggml_context * ctx0 = ggml_init(params);
-
-    ggml_cgraph * gf = ggml_new_graph(ctx0);
-
-    struct ggml_tensor * cur;
-    struct ggml_tensor * inpL;
-
-    if (batch.token) {
-        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
-
-        ggml_allocr_alloc(lctx.alloc, inp_tokens);
-        if (!ggml_allocr_is_measure(lctx.alloc)) {
-            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
-        }
-        ggml_set_name(inp_tokens, "inp_tokens");
-        inpL = ggml_get_rows(ctx0, model.tok_embeddings, inp_tokens);
-    } else {
-        inpL = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
-        ggml_allocr_alloc(lctx.alloc, inpL);
-        if (!ggml_allocr_is_measure(lctx.alloc)) {
-            memcpy(inpL->data, batch.embd, n_tokens * n_embd * ggml_element_size(inpL));
-        }
-    }
-    const int i_gpu_start = n_layer - n_gpu_layers;
-    (void) i_gpu_start;
-    offload_func_t offload_func_nr = llama_nop; // nr = non-repeating
-    offload_func_t offload_func_kq = llama_nop;
-    offload_func_t offload_func_v  = llama_nop;
-    // KQ_scale
-    struct ggml_tensor * KQ_scale = ggml_new_tensor_1d(ctx0, GGML_TYPE_F32, 1);
-    ggml_allocr_alloc(lctx.alloc, KQ_scale);
-    if (!ggml_allocr_is_measure(lctx.alloc)) {
-        ggml_set_f32(KQ_scale, 1.0f/sqrtf(float(n_embd_head)));
-    }
-    ggml_set_name(KQ_scale, "1/sqrt(n_embd_head)");
-    struct ggml_tensor * KQ_mask = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, n_kv, n_tokens, 1);
-    offload_func_kq(KQ_mask);
-    ggml_set_name(KQ_mask, "KQ_mask");
-    ggml_allocr_alloc(lctx.alloc, KQ_mask);
-
-    if (!ggml_allocr_is_measure(lctx.alloc)) {
-        float * data = (float *) KQ_mask->data;
-        memset(data, 0, ggml_nbytes(KQ_mask));
-        for (int h = 0; h < 1; ++h) {
-            for (int j = 0; j < n_tokens; ++j) {
-                const llama_pos    pos    = batch.pos[j];
-                const llama_seq_id seq_id = batch.seq_id[j];
-                for (int i = 0; i < n_kv; ++i) {
-                    if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
-                        data[h*(n_kv*n_tokens) + j*n_kv + i] = -INFINITY;
-                    }
-                }
-            }
-        }
-    }
-
-    struct ggml_tensor * KQ_pos = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
-    offload_func_kq(KQ_pos);
-    ggml_set_name(KQ_pos, "KQ_pos");
-    ggml_allocr_alloc(lctx.alloc, KQ_pos);
-    if (!ggml_allocr_is_measure(lctx.alloc)) {
-        int * data = (int *) KQ_pos->data;
-        for (int i = 0; i < n_tokens; ++i) {
-            data[i] = batch.pos[i];
-        }
-    }
-    if (do_rope_shift) {
-        struct ggml_tensor * K_shift = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_ctx);
-        offload_func_kq(K_shift);
-        ggml_set_name(K_shift, "K_shift");
-        ggml_allocr_alloc(lctx.alloc, K_shift);
-        if (!ggml_allocr_is_measure(lctx.alloc)) {
-            int * data = (int *) K_shift->data;
-            for (int i = 0; i < n_ctx; ++i) {
-                data[i] = kv_self.cells[i].delta;
-            }
-        }
-        for (int il = 0; il < n_layer; ++il) {
-            struct ggml_tensor * tmp =
-                    // we rotate only the first n_rot dimensions.
-                    ggml_rope_custom_inplace(ctx0,
-                        ggml_view_3d(ctx0, kv_self.k,
-                            n_rot, n_head, n_ctx,
-                            ggml_element_size(kv_self.k)*n_embd_gqa,
-                            ggml_element_size(kv_self.k)*n_embd_head,
-                            ggml_element_size(kv_self.k)*(n_embd_head*n_ctx*il)
-                        ),
-                        K_shift, n_rot, 2, 0, freq_base, freq_scale);
-            offload_func_kq(tmp);
-            ggml_build_forward_expand(gf, tmp);
-        }
-    }
-    for (int il=0; il < n_layer; ++il) {
-        struct ggml_tensor * residual = inpL;
-        offload_func_t offload_func = llama_nop;
-        {
-            cur = ggml_norm(ctx0, inpL, norm_eps);
-            offload_func(cur);
-            cur = ggml_mul(ctx0, cur, model.layers[il].attn_norm);
-            offload_func(cur);
-            cur = ggml_add(ctx0, cur, model.layers[il].attn_norm_b);
-            offload_func(cur);
-            ggml_format_name(cur, "input_layernorm_%d", il);
-        }
-        // self attention
-        {
-            cur = ggml_mul_mat(ctx0, model.layers[il].wqkv, cur);
-            offload_func_kq(cur);
-            cur = ggml_add(ctx0, cur, model.layers[il].bqkv);
-            offload_func_kq(cur);
-
-            // split qkv
-            GGML_ASSERT(n_head_kv == n_head);
-            ggml_set_name(cur, format("qkv_%d", il).c_str());
-            struct ggml_tensor * tmpqkv = ggml_reshape_4d(ctx0, cur, n_embd_head, 3, n_head, n_tokens);
-            offload_func_kq(tmpqkv);
-            struct ggml_tensor * tmpqkv_perm = ggml_cont(ctx0, ggml_permute(ctx0, tmpqkv, 0, 3, 1, 2));
-            offload_func_kq(tmpqkv_perm);
-            ggml_format_name(tmpqkv_perm, "tmpqkv_perm_%d", il);
-            struct ggml_tensor * tmpq = ggml_view_3d(
-                    ctx0, tmpqkv_perm, n_embd_head, n_head, n_tokens,
-                    ggml_element_size(tmpqkv_perm) * n_embd_head,
-                    ggml_element_size(tmpqkv_perm) * n_embd_head * n_head,
-                    0
-                );
-            offload_func_kq(tmpq);
-            struct ggml_tensor * tmpk = ggml_view_3d(
-                    ctx0, tmpqkv_perm, n_embd_head, n_head, n_tokens,
-                    ggml_element_size(tmpqkv_perm) * n_embd_head,
-                    ggml_element_size(tmpqkv_perm) * n_embd_head * n_head,
-                    ggml_element_size(tmpqkv_perm) * n_embd_head * n_head * n_tokens
-                );
-            offload_func_kq(tmpk);
-            // Q/K Layernorm
-            tmpq = ggml_norm(ctx0, tmpq, norm_eps);
-            offload_func_kq(tmpq);
-            tmpq =  ggml_mul(ctx0, tmpq, model.layers[il].attn_q_norm);
-            offload_func_kq(tmpq);
-            tmpq =  ggml_add(ctx0, tmpq, model.layers[il].attn_q_norm_b);
-            offload_func_kq(tmpq);
-
-            tmpk = ggml_norm(ctx0, tmpk, norm_eps);
-            offload_func_v(tmpk);
-            tmpk =  ggml_mul(ctx0, tmpk, model.layers[il].attn_k_norm);
-            offload_func_v(tmpk);
-            tmpk =  ggml_add(ctx0, tmpk, model.layers[il].attn_k_norm_b);
-            offload_func_v(tmpk);
-
-            // RoPE the first n_rot of q/k, pass the other half, and concat.
-            struct ggml_tensor * qrot = ggml_view_3d(
-                ctx0, tmpq, n_rot, n_head, n_tokens,
-                ggml_element_size(tmpq) * n_embd_head,
-                ggml_element_size(tmpq) * n_embd_head * n_head,
-                0
-            );
-            offload_func_kq(qrot);
-            ggml_format_name(qrot, "qrot_%d", il);
-            struct ggml_tensor * krot = ggml_view_3d(
-                ctx0, tmpk, n_rot, n_head, n_tokens,
-                ggml_element_size(tmpk) * n_embd_head,
-                ggml_element_size(tmpk) * n_embd_head * n_head,
-                0
-            );
-            offload_func_kq(krot);
-            ggml_format_name(krot, "krot_%d", il);
-
-            // get the second half of tmpq, e.g tmpq[n_rot:, :, :]
-            struct ggml_tensor * qpass = ggml_view_3d(
-                ctx0, tmpq, n_rot, n_head, n_tokens,
-                ggml_element_size(tmpq) * n_embd_head,
-                ggml_element_size(tmpq) * n_embd_head * n_head,
-                ggml_element_size(tmpq) * n_rot
-            );
-            offload_func_kq(qpass);
-            ggml_format_name(qpass, "qpass_%d", il);
-            struct ggml_tensor * kpass = ggml_view_3d(
-                ctx0, tmpk, n_rot, n_head, n_tokens,
-                ggml_element_size(tmpk) * n_embd_head,
-                ggml_element_size(tmpk) * n_embd_head * n_head,
-                ggml_element_size(tmpk) * n_rot
-            );
-            offload_func_kq(kpass);
-            ggml_format_name(kpass, "kpass_%d", il);
-
-            struct ggml_tensor * qrotated =  ggml_rope_custom(
-                    ctx0, qrot, KQ_pos, n_rot, 2, 0, freq_base, freq_scale
-            );
-            offload_func_kq(qrotated);
-            struct ggml_tensor * krotated = ggml_rope_custom(
-                    ctx0, krot, KQ_pos, n_rot, 2, 0, freq_base, freq_scale
-            );
-            offload_func_kq(krotated);
-            // ggml currently only supports concatenation on dim=2
-            // so we need to permute qrot, qpass, concat, then permute back.
-            qrotated = ggml_cont(ctx0, ggml_permute(ctx0, qrotated, 2, 1, 0, 3));
-            offload_func_kq(qrotated);
-            krotated = ggml_cont(ctx0, ggml_permute(ctx0, krotated, 2, 1, 0, 3));
-            offload_func_kq(krotated);
-
-            qpass = ggml_cont(ctx0, ggml_permute(ctx0, qpass, 2, 1, 0, 3));
-            offload_func_kq(qpass);
-            kpass = ggml_cont(ctx0, ggml_permute(ctx0, kpass, 2, 1, 0, 3));
-            offload_func_kq(kpass);
-
-            struct ggml_tensor * Qcur = ggml_concat(ctx0, qrotated, qpass);
-            offload_func_kq(Qcur);
-            struct ggml_tensor * Kcur = ggml_concat(ctx0, krotated, kpass);
-            offload_func_kq(Kcur);
-
-            struct ggml_tensor * Q = ggml_cont(ctx0, ggml_permute(ctx0, Qcur, 1, 2, 0, 3));
-            offload_func_kq(Q);
-
-            Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 2, 1, 0, 3));
-            offload_func_kq(Kcur);
-            {
-                struct ggml_tensor * tmpv = ggml_view_3d(
-                        ctx0, tmpqkv_perm, n_embd_head, n_head, n_tokens,
-                        ggml_element_size(tmpqkv_perm) * n_embd_head,
-                        ggml_element_size(tmpqkv_perm) * n_embd_head * n_head,
-                        ggml_element_size(tmpqkv_perm) * n_embd_head * n_head * n_tokens * 2
-                    );
-                offload_func_v(tmpv);
-                // store K, V in cache
-                struct ggml_tensor * Vcur = ggml_transpose(ctx0, ggml_reshape_2d(ctx0, tmpv, n_embd_gqa, n_tokens));
-                offload_func_v(Vcur);
-                ggml_set_name(Vcur, "Vcur");
-
-                struct ggml_tensor * k = ggml_view_1d(
-                    ctx0, kv_self.k, n_tokens*n_embd_gqa,
-                    (ggml_element_size(kv_self.k)*n_embd_gqa)*(il*n_ctx + kv_head)
-                );
-                offload_func_kq(k);
-                ggml_set_name(k, "k");
-
-                struct ggml_tensor * v = ggml_view_2d(ctx0, kv_self.v, n_tokens, n_embd_gqa,
-                        (   n_ctx)*ggml_element_size(kv_self.v),
-                        (il*n_ctx)*ggml_element_size(kv_self.v)*n_embd_gqa + kv_head*ggml_element_size(kv_self.v));
-                offload_func_v(v);
-                ggml_set_name(v, "v");
-
-                // important: storing RoPE-ed version of K in the KV cache!
-                ggml_build_forward_expand(gf, ggml_cpy(ctx0, Kcur, k));
-                ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v));
-            }
-            struct ggml_tensor * K = ggml_view_3d(ctx0, kv_self.k,
-                    n_embd_head, n_kv, n_head_kv,
-                    ggml_element_size(kv_self.k)*n_embd_gqa,
-                    ggml_element_size(kv_self.k)*n_embd_head,
-                    ggml_element_size(kv_self.k)*n_embd_gqa*n_ctx*il);
-
-            offload_func_kq(K);
-            ggml_format_name(K, "K_%d", il);
-
-            struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);
-            offload_func_kq(KQ);
-            ggml_set_name(KQ, "KQ");
-
-            struct ggml_tensor * KQ_scaled = ggml_scale(ctx0, KQ, KQ_scale);
-            offload_func_kq(KQ_scaled);
-            ggml_set_name(KQ_scaled, "KQ_scaled");
-
-            struct ggml_tensor * KQ_masked = ggml_add(ctx0, KQ_scaled, KQ_mask);
-            offload_func_kq(KQ_masked);
-            ggml_set_name(KQ_masked, "KQ_masked");
-
-            struct ggml_tensor * KQ_soft_max = ggml_soft_max_inplace(ctx0, KQ_masked);
-            offload_func_kq(KQ_soft_max);
-            ggml_set_name(KQ_soft_max, "KQ_soft_max");
-
-            struct ggml_tensor * V =
-                ggml_view_3d(ctx0, kv_self.v,
-                        n_kv, n_embd_head, n_head_kv,
-                        ggml_element_size(kv_self.v)*n_ctx,
-                        ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
-                        ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il);
-            offload_func_v(V);
-            ggml_set_name(V, "V");
-
-            struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);
-            offload_func_v(KQV);
-            ggml_set_name(KQV, "KQV");
-
-            struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3);
-            offload_func_v(KQV_merged);
-            ggml_set_name(KQV_merged, "KQV_merged");
-
-            cur = ggml_cont_2d(ctx0, KQV_merged, n_embd, n_tokens);
-            offload_func_v(cur);
-            ggml_set_name(cur, "KQV_merged_contiguous");
-
-            cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
-            offload_func(cur);
-            cur = ggml_add(ctx0, cur, model.layers[il].bo);
-            offload_func(cur);
-            ggml_set_name(cur, "result_wo");
-        }
-
-        struct ggml_tensor * inpFF = ggml_add(ctx0, residual, cur);
-        offload_func(inpFF);
-        ggml_set_name(inpFF, "inpFF");
-        {
-            // MLP
-            {
-                // Norm
-                cur = ggml_norm(ctx0, inpFF, norm_eps);
-                offload_func(cur);
-                cur = ggml_add(ctx0,
-                    ggml_mul(ctx0, cur, model.layers[il].ffn_norm),
-                    model.layers[il].ffn_norm_b
-                );
-                ggml_set_name(cur, "ffn_norm");
-                offload_func(cur);
-            }
-            cur = ggml_mul_mat(ctx0, model.layers[il].w3, cur);
-            offload_func(cur);
-
-            cur = ggml_add(ctx0, cur, model.layers[il].b3);
-            offload_func(cur);
-            ggml_set_name(cur, "result_ffn_up");
-
-            cur = ggml_sqr(ctx0, ggml_relu(ctx0, cur));
-            ggml_set_name(cur, "result_ffn_act");
-            offload_func(cur);
-            offload_func(cur->src[0]);
-
-            cur = ggml_mul_mat(ctx0, model.layers[il].w2, cur);
-            offload_func(cur);
-            cur = ggml_add(ctx0,
-                cur,
-                model.layers[il].b2);
-            offload_func(cur);
-            ggml_set_name(cur, "outFF");
-        }
-        cur = ggml_add(ctx0, cur, inpFF);
-        offload_func(cur);
-        ggml_set_name(cur, "inpFF_+_outFF");
-        inpL = cur;
-    }
-    cur = inpL;
-    {
-        cur = ggml_norm(ctx0, cur, norm_eps);
-        offload_func_nr(cur);
-        cur = ggml_mul(ctx0, cur, model.output_norm);
-        offload_func_nr(cur);
-
-        cur = ggml_add(ctx0, cur, model.output_norm_b);
-        // offload_func_nr(cur);
-
-        ggml_set_name(cur, "result_norm");
-    }
-    cur = ggml_mul_mat(ctx0, model.output, cur);
-    ggml_set_name(cur, "result_output");
-    ggml_build_forward_expand(gf, cur);
-    ggml_free(ctx0);
-    return gf;
-}
-
 static struct ggml_cgraph * llama_build_graph(
         llama_context & lctx,
     const llama_batch & batch) {
@@ -4927,10 +4372,6 @@ static struct ggml_cgraph * llama_build_graph(
            {
                result = llm_build_starcoder(lctx, batch);
            } break;
-        case LLM_ARCH_PERSIMMON:
-            {
-                result = llm_build_persimmon(lctx, batch);
-            } break;
        case LLM_ARCH_REFACT:
            {
                result = llm_build_refact(lctx, batch);
@@ -5013,6 +4454,10 @@ static int llama_decode_internal(
        batch.seq_id = seq_id.data();
    }

+    // we always start to search for a free slot from the start of the cache
+    // TODO: better strategies can be implemented
+    kv_self.head = 0;
+
    if (!llama_kv_cache_find_slot(kv_self, batch)) {
        return 1;
    }
@@ -5098,12 +4543,8 @@ static int llama_decode_internal(
 #endif

    // update the kv ring buffer
-    lctx.kv_self.has_shift  = false;
    lctx.kv_self.head      += n_tokens;
-    // Ensure kv cache head points to a valid index.
-    if (lctx.kv_self.head >= lctx.kv_self.size) {
-        lctx.kv_self.head = 0;
-    }
+    lctx.kv_self.has_shift  = false;

 #ifdef GGML_PERF
    // print timing information per ggml operation (for debugging purposes)
@@ -7198,7 +6639,6 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
    }

    std::ofstream fout(fname_out, std::ios::binary);
-    fout.exceptions(std::ofstream::failbit); // fail fast on write errors

    const size_t meta_size = gguf_get_meta_size(ctx_out);

@@ -8693,9 +8133,7 @@ int llama_token_to_piece(const struct llama_model * model, llama_token token, ch
                buf[0] = llama_token_to_byte(model->vocab, token);
                return 1;
            } else {
-                // TODO: for now we accept all unsupported token types,
-                // suppressing them like CONTROL tokens.
-                // GGML_ASSERT(false);
+                GGML_ASSERT(false);
            }
            break;
        }
@@ -8711,9 +8149,7 @@ int llama_token_to_piece(const struct llama_model * model, llama_token token, ch
            } else if (llama_is_control_token(model->vocab, token)) {
                ;
            } else {
-                // TODO: for now we accept all unsupported token types,
-                // suppressing them like CONTROL tokens.
-                // GGML_ASSERT(false);
+                GGML_ASSERT(false);
            }
            break;
        }
@@ -8745,14 +8181,14 @@ void llama_print_timings(struct llama_context * ctx) {
    const llama_timings timings = llama_get_timings(ctx);

    LLAMA_LOG_INFO("\n");
-    LLAMA_LOG_INFO("%s:        load time = %10.2f ms\n", __func__, timings.t_load_ms);
-    LLAMA_LOG_INFO("%s:      sample time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)\n",
+    LLAMA_LOG_INFO("%s:        load time = %8.2f ms\n", __func__, timings.t_load_ms);
+    LLAMA_LOG_INFO("%s:      sample time = %8.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)\n",
            __func__, timings.t_sample_ms, timings.n_sample, timings.t_sample_ms / timings.n_sample, 1e3 / timings.t_sample_ms * timings.n_sample);
-    LLAMA_LOG_INFO("%s: prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n",
+    LLAMA_LOG_INFO("%s: prompt eval time = %8.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n",
            __func__, timings.t_p_eval_ms, timings.n_p_eval, timings.t_p_eval_ms / timings.n_p_eval, 1e3 / timings.t_p_eval_ms * timings.n_p_eval);
-    LLAMA_LOG_INFO("%s:        eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)\n",
+    LLAMA_LOG_INFO("%s:        eval time = %8.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)\n",
            __func__, timings.t_eval_ms, timings.n_eval, timings.t_eval_ms / timings.n_eval, 1e3 / timings.t_eval_ms * timings.n_eval);
-    LLAMA_LOG_INFO("%s:       total time = %10.2f ms\n", __func__, (timings.t_end_ms - timings.t_start_ms));
+    LLAMA_LOG_INFO("%s:       total time = %8.2f ms\n", __func__, (timings.t_end_ms - timings.t_start_ms));
 }

 void llama_reset_timings(struct llama_context * ctx) {
--- a/prompts/LLM-questions.txt
+++ b/prompts/LLM-questions.txt
@@ -1,49 +0,0 @@
-In the context of LLMs, what is "Attention"?
-In the context of LLMs, what is a completion?
-In the context of LLMs, what is a prompt?
-In the context of LLMs, what is GELU?
-In the context of LLMs, what is RELU?
-In the context of LLMs, what is softmax?
-In the context of LLMs, what is decoding?
-In the context of LLMs, what is encoding?
-In the context of LLMs, what is tokenizing?
-In the context of LLMs, what is an embedding?
-In the context of LLMs, what is quantization?
-In the context of LLMs, what is a tensor?
-In the context of LLMs, what is a sparse tensor?
-In the context of LLMs, what is a vector?
-In the context of LLMs, how is attention implemented?
-In the context of LLMs, why is attention all you need?
-In the context of LLMs, what is "RoPe" and what is it used for?
-In the context of LLMs, what is "LoRA" and what is it used for?
-In the context of LLMs, what are weights?
-In the context of LLMs, what are biases?
-In the context of LLMs, what are checkpoints?
-In the context of LLMs, what is "perplexity"?
-In the context of LLMs, what are models?
-In the context of machine-learning, what is "catastrophic forgetting"?
-In the context of machine-learning, what is "elastic weight consolidation (EWC)"?
-In the context of neural nets, what is a hidden layer?
-In the context of neural nets, what is a convolution?
-In the context of neural nets, what is dropout?
-In the context of neural nets, what is cross-entropy?
-In the context of neural nets, what is over-fitting?
-In the context of neural nets, what is under-fitting?
-What is the difference between an interpreted computer language and a compiled computer language?
-In the context of software development, what is a debugger?
-When processing using a GPU, what is off-loading?
-When processing using a GPU, what is a batch?
-When processing using a GPU, what is a block?
-When processing using a GPU, what is the difference between a batch and a block?
-When processing using a GPU, what is a scratch tensor?
-When processing using a GPU, what is a layer?
-When processing using a GPU, what is a cache?
-When processing using a GPU, what is unified memory?
-When processing using a GPU, what is VRAM?
-When processing using a GPU, what is a kernel?
-When processing using a GPU, what is "metal"?
-In the context of LLMs, what are "Zero-Shot", "One-Shot" and "Few-Shot" learning models?
-In the context of LLMs, what is the "Transformer-model" architecture?
-In the context of LLMs, what is "Multi-Head Attention"?
-In the context of LLMs, what is "Self-Attention"?
-In the context of transformer-model architectures, how do attention mechanisms use masks?
--- a/prompts/parallel-questions.txt
+++ b/prompts/parallel-questions.txt
@@ -1,43 +0,0 @@
-What do you know about Hobbits?
-What is quantum field theory?
-Why did the chicken cross the road?
-Who is the president of the United States?
-How do I run CMake on MacOS?
-Do you agree that C++ is a really finicky language compared with Python3?
-Is it a good idea to invest in technology?
-Do you like Wagner's Ring?
-Do you think this file input option is really neat?
-What should we all do about climate change?
-Is time-travel possible within the laws of current physics?
-Is it like anything to be a bat?
-Once the chicken has crossed the road, does it try to go back?
-Who is the greatest of all musical composers?
-What is art?
-Is there life elsewhere in the universe?
-What is intelligence?
-What is the difference between knowledge and intelligence?
-Will religion ever die?
-Do we understand ourselves?
-What is the best way to cook eggs?
-If you cannot see things, on what basis do you evaluate them?
-Explain the role of the np junction in photovoltaic cells?
-Is professional sport a good or bad influence on human behaviour?
-Is capital punishment immoral?
-Should we care about other people?
-Who are you?
-Which sense would you surrender if you could?
-Was Henry Ford a hero or a villain?
-Do we need leaders?
-What is nucleosynthesis?
-Who is the greatest scientist of all time?
-Who first observed what came to be known as the photovoltaic effect?
-What is nuclear fusion and why does it release energy?
-Can you know that you exist?
-What is an exoplanet?
-Do you like cream?
-What is the difference?
-Can I know that I exist while I'm dreaming that I'm Descartes?
-Who said "I didn't know I thought that until I heard myself saying it"?
-Does anything really matter?
-Can you explain the unreasonable effectiveness of mathematics?
-
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,3 @@
-numpy==1.24.4
+numpy==1.24
 sentencepiece==0.1.98
 gguf>=0.1.0
Author	SHA1	Message	Date
Georgi Gerganov	5ab6c2132a	server-parallel : add "--reverse-prompt" + compiler warning fixes	2023-10-06 14:32:19 +03:00
FSSRepo	afc09db51c	fix json format README	2023-10-05 15:23:58 -04:00
FSSRepo	eb75395b5c	remove trail whitespace	2023-10-05 15:18:47 -04:00
FSSRepo	a7a6ceb7ae	server handling multiple clients with cam	2023-10-05 15:12:39 -04:00