How do AI engineers actually evaluate LLM/RAG systems in practice?
I’ve built multiple LLM/AI projects so far, but I realized I never properly learned how evaluation is actually done in real AI engineering workflows.
Recently I’ve been reading AI Engineering by Chip Huyen, and one thing that stood out was the idea that you should evaluate every layer of the system, not just the final output:
- prompts
- retrieval quality in RAG
- chunking
- reranking
- hallucinations
- latency/cost
- end-to-end answer quality
- AI-as-a-judge systems, etc.
What I’m confused about is how this is actually done in practice by engineers.
For example:
- Do people usually create their own eval datasets?
- Or do you use public benchmark datasets?
- How do you evaluate retrieval quality specifically?
- How are prompts compared systematically?
- How much of evaluation is automated vs human review?
- What tools/platforms are commonly used in industry right now?
- Are frameworks like Ragas, DeepEval, LangSmith, TruLens, etc. actually used in production?
- How do teams prevent regressions when changing prompts/models/chunking strategies?
I think I’m missing the “engineering mindset” around evaluation. Until now I’ve mostly been doing:
>the outputs look good enough
But I want to learn how people build reliable evaluation pipelines and iterate systematically.
Would really appreciate:
- practical workflows
- examples from real projects
- beginner-friendly resources
- advice on what I should build to learn this properly
Especially interested in RAG + agent evaluation.
Thanks!