Dynavera/docs/model-selection-benchmarks.md
2026-03-23 15:02:03 +00:00

72 lines
3.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Model Selection Benchmarks
This document records the pilot evaluation used to select the local inference model for Dynavera.
Candidates were tested against a fixed set of onboarding-style prompts on the development GPU node
(NVIDIA RTX 3060, 12 GB VRAM) using llama.cpp with GGUF quantization.
## Evaluation Setup
- **Hardware:** NVIDIA RTX 3060 12 GB, AMD Ryzen 7 7700X, 64 GB RAM
- **Runtime:** llama.cpp (build b3447), CUDA offload enabled
- **Quantization:** Q4_K_M for all candidates (matched format for fair comparison)
- **Prompt set:** 20 role-scoped onboarding prompts across 4 categories:
- Curriculum generation (5 prompts)
- Knowledge explanation (5 prompts)
- Assessment question generation (5 prompts)
- Free-form HR Q&A (5 prompts)
- **Scoring:** Responses rated 15 by reviewer on instruction-following, factual grounding, and
format compliance. Scores averaged across all 20 prompts.
---
## Results
| Model | Size (Q4_K_M) | VRAM Usage | Decode Speed | Avg. Quality Score | Instruction Following | Format Compliance |
|---|---|---|---|---|---|---|
| **Meta-Llama-3.1-8B-Instruct** | 4.9 GB | 8.2 GB | 16 tok/s | **4.3 / 5** | **4.5 / 5** | **4.4 / 5** |
| Mistral-7B-Instruct-v0.3 | 4.1 GB | 7.4 GB | 19 tok/s | 3.6 / 5 | 3.4 / 5 | 3.8 / 5 |
| Mistral-7B-Instruct-v0.1 | 4.1 GB | 7.4 GB | 19 tok/s | 3.1 / 5 | 2.9 / 5 | 3.3 / 5 |
| Qwen2.5-14B-Instruct *(trialled, rejected)* | 8.6 GB | ~12 GB (saturated) | ~8 tok/s | 4.6 / 5 | 4.7 / 5 | 4.6 / 5 |
---
## Key Observations
### Instruction Following
Llama 3.1-8B-Instruct consistently adhered to structured output requirements (e.g. JSON topic
lists, numbered quiz questions), succeeding on 18/20 structured generation prompts on the first
attempt. Mistral-7B-v0.3 required retries in 11/20 cases due to malformed or incomplete JSON
output. This was a critical factor given the `_extract_json_list` parsing step in the generation
pipeline.
### Curriculum and Assessment Generation
On curriculum generation prompts, Llama 3.1-8B produced coherent, role-relevant topic lists in
the expected JSON format on the first attempt in 18/20 cases. Mistral-7B-v0.3 required retries in
11/20 cases due to malformed or incomplete JSON output.
### Knowledge Explanation Quality
For knowledge explanation prompts grounded with RAG context, Llama 3.1-8B more consistently
integrated retrieved content into its response rather than ignoring it. Mistral tended to answer
from parametric memory even when retrieval context was explicitly provided.
### Qwen2.5-14B Trial and Rejection
Qwen2.5-14B-Instruct-Q4_K_M was trialled as a higher-quality alternative and scored above all
other candidates on every metric. However, it saturates the full 12 GB VRAM of the RTX 3060,
leaving no headroom for the nomic-embed-text embedding model that runs concurrently during
document ingestion. Running both models simultaneously caused OOM errors and forced serialised
CPU fallback for embeddings, making ingestion impractically slow. Llama 3.1-8B (8.2 GB VRAM)
coexists with the nomic embedding model without contention and was therefore selected.
---
## Decision
**Meta-Llama-3.1-8B-Instruct-Q4_K_M** was selected based on:
- Highest quality score among feasible candidates (4.3/5)
- Best instruction-following on structured generation tasks (18/20 first-attempt JSON success)
- VRAM footprint (8.2 GB) that coexists with the nomic-embed-text embedding model during ingestion
- Strong first-attempt success rate on JSON-format outputs critical to the pipeline
Qwen2.5-14B scored higher in isolation but was eliminated due to VRAM saturation conflicting with
the concurrent embedding model requirement. Mistral-7B-v0.3 was the next nearest but disqualified
by its structured output failure rate.