Compare commits

...

63 Commits

Author SHA1 Message Date
Georgi Gerganov
f49c636db0 llama-eval : protect dump() with lock for thread safety
Assisted-by: llama.cpp:local pi
2026-05-10 21:52:43 +03:00
Georgi Gerganov
d5165e8f2e llama-eval : require --grader-model or --model when using --grader-type llm
Assisted-by: llama.cpp:local pi
2026-05-10 21:49:58 +03:00
Georgi Gerganov
85c6aa006d llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
Assisted-by: llama.cpp:local pi
2026-05-10 21:49:02 +03:00
Georgi Gerganov
e5ac6d1da6 llama-eval : track model name in eval state and verify on resume
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi
2026-05-10 21:43:35 +03:00
Georgi Gerganov
094554dbcc llama-eval : update README with PR link and quick-start examples
Assisted-by: llama.cpp:local pi
2026-05-10 21:22:48 +03:00
Georgi Gerganov
f64d56bcd8 llama-server-simulator : replace Flask with stdlib http.server
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi
2026-05-10 20:47:08 +03:00
ggerganov
43f14a0a46 llama-eval : support multiple evaluation endpoints with dynamic task distribution
- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi
2026-05-10 20:42:23 +03:00
Georgi Gerganov
d26b1ffcc9 llama-eval : rename display, escaped, and count variables to use prefix convention
- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi
2026-05-10 19:24:29 +03:00
Georgi Gerganov
9f10d8d195 llama-eval : add per-task generation time from server timings
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi
2026-05-10 19:15:34 +03:00
Georgi Gerganov
4d5dedc569 llama-eval : add per-task generation speed from server timings
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi
2026-05-10 19:05:20 +03:00
Georgi Gerganov
81a65cf035 eval : add Wilson score confidence interval to results
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
2026-05-10 18:46:36 +03:00
Georgi Gerganov
7d433f767b eval : unify "judge" terminology to "grader"
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi
2026-05-10 18:23:28 +03:00
Georgi Gerganov
633a68d6c2 remove junk 2026-05-10 18:13:50 +03:00
Georgi Gerganov
e0a2cf48ca track total time 2026-05-10 18:13:50 +03:00
Georgi Gerganov
bad9565a1e refactor 2026-05-10 18:13:50 +03:00
Georgi Gerganov
752b703a5e resoning and error handling 2026-05-10 18:13:50 +03:00
Georgi Gerganov
fc571f3a1e add tokens 2026-05-10 18:13:50 +03:00
Georgi Gerganov
6797d80dff store full response 2026-05-10 18:13:50 +03:00
Georgi Gerganov
3649793811 add html 2026-05-10 18:13:50 +03:00
Georgi Gerganov
7e8c88c5e0 fix prompts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
2e0b6766f3 simplify 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f95f4dd1ca fix counts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
095c8ab655 cleanup 2026-05-10 18:13:49 +03:00
Georgi Gerganov
d830acacc5 resume eval 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f35b10f0a9 ignore errors 2026-05-10 18:13:49 +03:00
Georgi Gerganov
802d85e26e add AGENTS.md 2026-05-10 18:13:49 +03:00
Georgi Gerganov
91bd92c6b6 cleanup 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f20b5a72cf datasets : fix aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
122dfe3eab grade : improve regex + logs 2026-05-10 18:13:48 +03:00
Georgi Gerganov
8b94ab4f4a grader : update prompt 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f99d77f3bd datasets : add aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
55a7cf4a06 cont 2026-05-10 18:13:48 +03:00
Georgi Gerganov
6e7e1a5a63 grader : improve example answers 2026-05-10 18:13:48 +03:00
Georgi Gerganov
9f02fa6382 rename 2026-05-10 18:13:47 +03:00
Georgi Gerganov
e7b8646098 add gpqa + sampling + docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
55ce1b4e2f datasets : add gsm8k 2026-05-10 18:13:47 +03:00
Georgi Gerganov
abec77e068 remove old files 2026-05-10 18:13:47 +03:00
Georgi Gerganov
65e3c5a928 docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
4f176f6a4d improve grader 2026-05-10 18:13:47 +03:00
Georgi Gerganov
9578e83ac2 minor 2026-05-10 18:13:47 +03:00
Georgi Gerganov
530f38f9c3 eval : support multiple dataset runs 2026-05-10 18:13:46 +03:00
Georgi Gerganov
cda8cae01a sim : fix answer matching 2026-05-10 18:13:46 +03:00
Georgi Gerganov
64720e1e01 test : fix path 2026-05-10 18:13:46 +03:00
Georgi Gerganov
1a780f7c44 eval : add prompts 2026-05-10 18:13:46 +03:00
Georgi Gerganov
940364e4c9 eval : print progress 2026-05-10 18:13:46 +03:00
Georgi Gerganov
ee9b715eb6 examples: add task summary table to llama-eval-new.py 2026-05-10 18:13:46 +03:00
Georgi Gerganov
d639ee52ea docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-05-10 18:13:46 +03:00
Georgi Gerganov
fb40d1a04a examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-05-10 18:13:45 +03:00
Georgi Gerganov
2fe445cc60 docs: update llama-eval-discussion.md with session work summary 2026-05-10 18:13:45 +03:00
Georgi Gerganov
3732aea2df examples: use cached dataset path in simulator to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
edc766c919 examples: use cached dataset path to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
d7d2c22909 examples: remove HF_HUB_OFFLINE to allow dataset download 2026-05-10 18:13:45 +03:00
Georgi Gerganov
30ea5124de examples: use HF_HUB_OFFLINE to avoid HF Hub warnings 2026-05-10 18:13:45 +03:00
Georgi Gerganov
0ca458d892 examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-05-10 18:13:45 +03:00
Georgi Gerganov
de8eda468b docs: remove README.md from llama-eval 2026-05-10 18:13:44 +03:00
Georgi Gerganov
a2b96e0444 examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-05-10 18:13:44 +03:00
Georgi Gerganov
deed078654 docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
05b8425bd6 examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
58bd57ba99 examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
2026-05-10 18:13:44 +03:00
gatbontonpc
5cbe95b6e5 add checkpointing 2026-05-10 18:13:44 +03:00
gatbontonpc
c7f3ce25f5 Add readme 2026-05-10 18:13:44 +03:00
gatbontonpc
4db4497ca7 multi source llama-eval 2026-05-10 18:13:43 +03:00
gatbontonpc
db8b09d6e8 working llama-eval mc and math suite 2026-05-10 18:13:42 +03:00
4 changed files with 1857 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
# llama-eval
Simple evaluation tool for llama.cpp with support for multiple datasets.
For a full description, usage examples, and sample results, see:
- [PR 21152](https://github.com/ggml-org/llama.cpp/pull/21152)
## Quick start
```bash
# Single server
python3 llama-eval.py \
--server http://localhost:8033 \
--model my-model \
--dataset gsm8k --n_cases 100 \
--grader-type regex --threads 32
# Multiple servers (comma-separated URLs and thread counts)
python3 llama-eval.py \
--server http://gpu1:8033,http://gpu2:8033 \
--server-name gpu1,gpu2 \
--threads 16,16 \
--dataset aime2025 --n_cases 240 \
--grader-type regex
```

1428
examples/llama-eval/llama-eval.py Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,317 @@
#!/usr/bin/env python3
import argparse
import json
import random
import re
import time
import sys
import os
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Dict, List, Optional
from dataclasses import dataclass
from pathlib import Path
import datasets
# Set cache directory for HuggingFace datasets
cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
cache_dir.mkdir(parents=True, exist_ok=True)
os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
def dice(s1: str, s2: str) -> float:
"""Calculate Dice coefficient between two strings based on bigram overlap."""
if not s1 and not s2:
return 1.0
def _bigrams(s: str):
return [s[i : i + 2] for i in range(len(s) - 1)]
bigrams1 = _bigrams(s1)
bigrams2 = _bigrams(s2)
if not bigrams1 and not bigrams2:
return 1.0
from collections import Counter
freq1 = Counter(bigrams1)
freq2 = Counter(bigrams2)
intersection = sum(min(freq1[bg], freq2[bg]) for bg in freq1)
dice_coeff = 2 * intersection / (len(bigrams1) + len(bigrams2))
return dice_coeff
def debug_log(message: str):
"""Log debug messages to both stdout and a file"""
print(message, file=sys.stderr)
with open("/tmp/simulator-debug.log", "a") as f:
f.write(message + "\n")
simulator: Optional["Simulator"] = None
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict]
sampling_config: Dict
def normalize_number(s: str) -> Optional[int]:
match = re.match(r"\d+", s) # match digits from the start
if not match:
return None
return int(match.group(0))
class AimeDataset:
def __init__(self, split: str = "train"):
self.split = split
self.questions: List[Dict] = []
self._load_dataset()
def _load_dataset(self):
print(f"Loading AIME dataset (split: {self.split})...")
cache_path = Path.home() / ".cache" / "huggingface" / "datasets" / "AI-MO___aimo-validation-aime" / "default" / "0.0.0"
if cache_path.exists():
print(f"Using cached dataset from {cache_path}")
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split, cache_dir=str(cache_path))
else:
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
self.questions = list(ds)
print(f"AIME dataset loaded: {len(self.questions)} questions")
def find_question(self, request_text: str) -> Optional[Dict]:
best_match = None
best_distance = -1
best_index = -1
for i, question in enumerate(self.questions):
question_text = question["problem"]
request_lower = request_text.lower()
question_lower = question_text.lower()
# Exact match
if question_lower == request_lower:
debug_log(f"DEBUG: Found exact match at index {i}")
return question
# Remove LaTeX formatting for more flexible matching
question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
if question_no_latex.lower() == request_lower:
debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
return question
# Calculate Dice coefficient for partial matches
# Only consider if request is at least 50% of question length
if len(request_lower) >= len(question_lower) * 0.5:
distance = dice(question_lower, request_lower)
if distance > best_distance:
best_distance = distance
best_match = question
best_index = i
if best_match and best_distance > 0.3: # Threshold for partial match
debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
return best_match
debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
return None
def get_answer(self, question: Dict) -> str:
answer = question["answer"]
if isinstance(answer, str):
normalized = normalize_number(answer)
return str(normalized) if normalized is not None else answer
return str(answer)
class Simulator:
def __init__(
self,
port: int = 8033,
host: str = "localhost",
success_rate: float = 0.8,
dataset_split: str = "train"
):
self.port = port
self.host = host
self.success_rate = success_rate
self.dataset = AimeDataset(dataset_split)
self.eval_state = EvalState(
id="aime-2025",
tasks=["aime"],
task_states={},
sampling_config={"temperature": 0, "max_tokens": 2048}
)
def _generate_response(
self,
question: Dict,
should_be_correct: bool
) -> Dict:
expected_answer = self.dataset.get_answer(question)
if should_be_correct:
response_text = expected_answer
else:
response_text = self._generate_wrong_answer(question)
return {
"id": f"chatcmpl-{int(time.time())}",
"object": "chat.completion",
"created": int(time.time()),
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": response_text
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
def _generate_wrong_answer(self, question: Dict) -> str:
expected_answer = self.dataset.get_answer(question)
if expected_answer.isdigit():
wrong_answer = str(int(expected_answer) + 1)
else:
wrong_answer = expected_answer + " (wrong)"
return wrong_answer
def _process_request(self, request_data: Dict) -> Dict:
messages = request_data.get("messages", [])
if not messages:
return {"error": "No messages in request"}
request_text = messages[0].get("content", "")
debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
question = self.dataset.find_question(request_text)
if not question:
debug_log(f"DEBUG: find_question returned None")
return {"error": "No matching question found"}
should_be_correct = random.random() < self.success_rate
response = self._generate_response(question, should_be_correct)
task_id = "aime"
self.eval_state.task_states[task_id] = {
"correct": should_be_correct,
"expected": self.dataset.get_answer(question),
"predicted": response["choices"][0]["message"]["content"]
}
return response
class RequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
if self.path != "/v1/chat/completions":
self._send_json({"error": "Not found"}, 404)
return
try:
content_length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_length)
request_data = json.loads(body) if body else None
if not request_data:
self._send_json({"error": "Invalid JSON"}, 400)
return
if simulator is None:
self._send_json({"error": "Simulator not initialized"}, 500)
return
response = simulator._process_request(request_data)
self._send_json(response, 200)
except json.JSONDecodeError:
self._send_json({"error": "Invalid JSON"}, 400)
except Exception as e:
print(f"Error processing request: {e}")
self._send_json({"error": str(e)}, 500)
def _send_json(self, data: dict, status: int = 200):
body = json.dumps(data).encode("utf-8")
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def log_message(self, format, *args):
# Suppress default request logging
pass
def main():
parser = argparse.ArgumentParser(
description="llama-server simulator for testing eval scripts"
)
parser.add_argument(
"--port",
type=int,
default=8033,
help="Server port (default: 8033)"
)
parser.add_argument(
"--host",
type=str,
default="localhost",
help="Server host (default: localhost)"
)
parser.add_argument(
"--success-rate",
type=float,
default=0.8,
help="Success rate 0-1 (default: 0.8)"
)
parser.add_argument(
"--dataset-split",
type=str,
default="train",
help="AIME dataset split to use (default: train)"
)
args = parser.parse_args()
global simulator
simulator = Simulator(
port=args.port,
host=args.host,
success_rate=args.success_rate,
dataset_split=args.dataset_split
)
server = HTTPServer((args.host, args.port), RequestHandler)
server_thread = threading.Thread(target=server.serve_forever, daemon=True)
server_thread.start()
print("\n=== llama-server-simulator ===")
print(f"Server running on http://{args.host}:{args.port}")
print(f"Success rate: {args.success_rate}")
print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
print("\nPress Ctrl+C to stop\n")
try:
server_thread.join()
except KeyboardInterrupt:
print("\nShutting down...")
server.shutdown()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,86 @@
#!/bin/bash
set -e
# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
echo "=== llama-server-simulator Test Script ==="
echo ""
PORT=8033
SUCCESS_RATE=0.8
TEST_PORT=8034
echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
source "$SCRIPT_DIR/venv/bin/activate"
python3 "$SCRIPT_DIR/llama-server-simulator.py" --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
SIMULATOR_PID=$!
echo "Waiting for simulator to start..."
sleep 5
# Helper function to make a request and extract the answer
make_request() {
local question="$1"
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"llama\",
\"messages\": [
{\"role\": \"user\", \"content\": \"$question\"}
],
\"temperature\": 0,
\"max_tokens\": 2048
}" | python3 -c "import sys, json; data = json.load(sys.stdin); print(data.get('choices', [{}])[0].get('message', {}).get('content', data.get('error', 'No response')))"
}
# Test question (repeated in multiple tests)
TEST_QUESTION="Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."
echo ""
echo "=== Test 1: Correct Answer ==="
echo "Sending request with known question..."
answer=$(make_request "$TEST_QUESTION")
echo "Answer: $answer"
echo "Expected: 116"
echo "Correct: $([ "$answer" == "116" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 2: Wrong Answer ==="
echo "Sending request with known question (success rate 0.0)..."
answer=$(make_request "$TEST_QUESTION")
echo "Answer: $answer"
echo "Expected: 116"
echo "Correct: $([ "$answer" == "116" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 3: No Matching Question ==="
echo "Sending request with non-matching text..."
response=$(make_request "What is the capital of France?")
echo "Response: $response"
echo "Expected: No matching question found"
echo "Correct: $([ "$response" == "No matching question found" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 4: Success Rate Verification ==="
echo "Sending 10 requests to test success rate..."
correct_count=0
for i in {1..10}; do
answer=$(make_request "$TEST_QUESTION")
if [ "$answer" == "116" ]; then
correct_count=$((correct_count + 1))
fi
echo " Request $i: Answer = $answer"
done
echo "Correct answers: $correct_count/10"
echo "Expected: ~8/10 (80% success rate)"
echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
echo ""
echo "=== Test Complete ==="
echo "Stopping simulator..."
kill $SIMULATOR_PID 2>/dev/null
wait $SIMULATOR_PID 2>/dev/null || true
echo "Simulator stopped."