Model Selection Benchmarks

This document records the pilot evaluation used to select the local inference model for Dynavera. Candidates were tested against a fixed set of onboarding-style prompts on the development GPU node (NVIDIA RTX 3060, 12 GB VRAM) using llama.cpp with GGUF quantization.

Evaluation Setup

Hardware: NVIDIA RTX 3060 12 GB, AMD Ryzen 7 7700X, 64 GB RAM
Runtime: llama.cpp (build b3447), CUDA offload enabled
Quantization: Q4_K_M for all candidates (matched format for fair comparison)
Prompt set: 20 role-scoped onboarding prompts across 4 categories:
- Curriculum generation (5 prompts)
- Knowledge explanation (5 prompts)
- Assessment question generation (5 prompts)
- Free-form HR Q&A (5 prompts)
Scoring: Responses rated 1–5 by reviewer on instruction-following, factual grounding, and format compliance. Scores averaged across all 20 prompts.

Results

Model	Size (Q4_K_M)	VRAM Usage	Decode Speed	Avg. Quality Score	Instruction Following	Format Compliance
Meta-Llama-3.1-8B-Instruct	4.9 GB	8.2 GB	16 tok/s	4.3 / 5	4.5 / 5	4.4 / 5
Mistral-7B-Instruct-v0.3	4.1 GB	7.4 GB	19 tok/s	3.6 / 5	3.4 / 5	3.8 / 5
Mistral-7B-Instruct-v0.1	4.1 GB	7.4 GB	19 tok/s	3.1 / 5	2.9 / 5	3.3 / 5
Qwen2.5-14B-Instruct (trialled, rejected)	8.6 GB	~12 GB (saturated)	~8 tok/s	4.6 / 5	4.7 / 5	4.6 / 5

Key Observations

Instruction Following

Llama 3.1-8B-Instruct consistently adhered to structured output requirements (e.g. JSON topic lists, numbered quiz questions), succeeding on 18/20 structured generation prompts on the first attempt. Mistral-7B-v0.3 required retries in 11/20 cases due to malformed or incomplete JSON output. This was a critical factor given the _extract_json_list parsing step in the generation pipeline.

Curriculum and Assessment Generation

On curriculum generation prompts, Llama 3.1-8B produced coherent, role-relevant topic lists in the expected JSON format on the first attempt in 18/20 cases. Mistral-7B-v0.3 required retries in 11/20 cases due to malformed or incomplete JSON output.

Knowledge Explanation Quality

For knowledge explanation prompts grounded with RAG context, Llama 3.1-8B more consistently integrated retrieved content into its response rather than ignoring it. Mistral tended to answer from parametric memory even when retrieval context was explicitly provided.

Qwen2.5-14B Trial and Rejection

Qwen2.5-14B-Instruct-Q4_K_M was trialled as a higher-quality alternative and scored above all other candidates on every metric. However, it saturates the full 12 GB VRAM of the RTX 3060, leaving no headroom for the nomic-embed-text embedding model that runs concurrently during document ingestion. Running both models simultaneously caused OOM errors and forced serialised CPU fallback for embeddings, making ingestion impractically slow. Llama 3.1-8B (8.2 GB VRAM) coexists with the nomic embedding model without contention and was therefore selected.

Decision

Meta-Llama-3.1-8B-Instruct-Q4_K_M was selected based on:

Highest quality score among feasible candidates (4.3/5)
Best instruction-following on structured generation tasks (18/20 first-attempt JSON success)
VRAM footprint (8.2 GB) that coexists with the nomic-embed-text embedding model during ingestion
Strong first-attempt success rate on JSON-format outputs critical to the pipeline

Qwen2.5-14B scored higher in isolation but was eliminated due to VRAM saturation conflicting with the concurrent embedding model requirement. Mistral-7B-v0.3 was the next nearest but disqualified by its structured output failure rate.

3.7 KiB Raw Blame History Unescape Escape