Evaluating Fine tuned LLM vs Enterprise LLM (GPT/Claude) for Marketing RCA - Framework Suggestions?
Hi everyone,
I’m working on a use case around RCA for marketing campaign performance, and we’re evaluating two different system architectures:
System 1: Enterprise LLM + RAG
- Model: GPT / Claude (API-based)
- Uses structured campaign data via RAG (SQL + semantic layer)
- no fine-tuning
- Relies on prompt engineering + retrieval
System 2: Fine-tuned Open Model + RAG
- Base model: evaluating models from Hugging Face
- Fine-tuned using LoRA on historical RCA cases
- Same RAG pipeline for campaign data
Our Goal:
We want to compare which system performs better in:
- Generating accurate RCA explanations
- Identifying key drivers (pricing, audience, channel, etc.)
- Producing actionable insights
My question:
Does this comparison framework make sense, or are we missing a better baseline?
What would be a robust evaluation checklist for this use case?
So far, we are thinking along:
- Accuracy of identified root causes
- Consistency across similar scenarios
- Business interpretability
- Latency & cost
Open Challenges
- Ground truth for RCA is subjective
- Multiple valid explanations possible
- Hard to quantify “quality” of insights
Would love inputs from folks who have evaluated LLMs in analytical/decision-support use cases.
Thanks!