u/Disastrous_Sock_254

Evaluating Fine tuned LLM vs Enterprise LLM (GPT/Claude) for Marketing RCA - Framework Suggestions?

Hi everyone,

I’m working on a use case around RCA for marketing campaign performance, and we’re evaluating two different system architectures:

System 1: Enterprise LLM + RAG

- Model: GPT / Claude (API-based)

- Uses structured campaign data via RAG (SQL + semantic layer)

- no fine-tuning

- Relies on prompt engineering + retrieval

System 2: Fine-tuned Open Model + RAG

- Base model: evaluating models from Hugging Face

- Fine-tuned using LoRA on historical RCA cases

- Same RAG pipeline for campaign data

Our Goal:

We want to compare which system performs better in:

- Generating accurate RCA explanations

- Identifying key drivers (pricing, audience, channel, etc.)

- Producing actionable insights

My question:

  1. Does this comparison framework make sense, or are we missing a better baseline?

  2. What would be a robust evaluation checklist for this use case?

So far, we are thinking along:

- Accuracy of identified root causes

- Consistency across similar scenarios

- Business interpretability

- Latency & cost

Open Challenges

- Ground truth for RCA is subjective

- Multiple valid explanations possible

- Hard to quantify “quality” of insights

Would love inputs from folks who have evaluated LLMs in analytical/decision-support use cases.

Thanks!

reddit.com
u/Disastrous_Sock_254 — 6 hours ago