How to replace vibes with a small task-specific dataset, adversarial prompts, and a scorecard.
LLM evaluation
What it is
LLM evaluation is the habit of measuring a model-backed workflow against a task-specific dataset. It is not a one-time benchmark. It is a feedback loop that says whether the system is getting better or merely sounding better.
Learning goal
Learn to design a small evaluation set that reflects the actual workflow: normal prompts, ambiguous prompts, adversarial prompts, Danish prompts, English prompts, and prompts that should fail or escalate.
Why it matters in production
Without evaluation, quality becomes anecdotal. A demo can look good while the system fails edge cases, invents unsupported answers, leaks boundaries, or regresses after a model change. A small scorecard gives the team a way to compare versions and decide whether a change is safe enough to keep.
How I actually build it
The first FOS evaluation slice should be intentionally small:
- Choose one workflow, ideally retrieval or answer generation over FOS context.
- Write around 30 prompts.
- Make at least half of them adversarial or boundary-testing.
- Mix Danish and English.
- Score faithfulness, hallucination, answer relevance, refusal behavior, and escalation behavior.
- Commit the dataset and a scorecard artifact.
The tool can be DeepEval, RAGAS, or a small custom harness if the first version needs fewer dependencies. The important part is repeatability.
Practice loop
- Write the first 10 obvious prompts.
- Add 10 prompts that try to break assumptions.
- Add 10 prompts that represent real user ambiguity.
- Run the current system.
- Read failures before changing the system.
- Change one thing and rerun.
Proof artifact
A useful proof is a dataset, command, and scorecard. The scorecard should show the date, model route, pass/fail dimensions, and the top three fixes suggested by the failures.
Current status
This capability is planned but not yet built in FOS. It should follow the guardrails slice so the evaluation set can include guardrail-specific prompts.
What worked, what didn't
The expected trap is overbuilding the metric suite before the dataset is good. The first dataset should be small enough to read manually and annoying enough to catch real mistakes.
Next build
Create a child ticket for a 30-prompt FOS evaluation harness and require the scorecard to be committed or exported as a public-safe artifact.
Further reading
- Guardrails
- Red-teaming
- RAGAS and DeepEval documentation for task-specific evaluation patterns.