u/GlitteringNinja9367

How do AI engineers actually evaluate LLM/RAG systems in practice?

I’ve built multiple LLM/AI projects so far, but I realized I never properly learned how evaluation is actually done in real AI engineering workflows.

Recently I’ve been reading AI Engineering by Chip Huyen, and one thing that stood out was the idea that you should evaluate every layer of the system, not just the final output:

  • prompts
  • retrieval quality in RAG
  • chunking
  • reranking
  • hallucinations
  • latency/cost
  • end-to-end answer quality
  • AI-as-a-judge systems, etc.

What I’m confused about is how this is actually done in practice by engineers.

For example:

  • Do people usually create their own eval datasets?
  • Or do you use public benchmark datasets?
  • How do you evaluate retrieval quality specifically?
  • How are prompts compared systematically?
  • How much of evaluation is automated vs human review?
  • What tools/platforms are commonly used in industry right now?
  • Are frameworks like Ragas, DeepEval, LangSmith, TruLens, etc. actually used in production?
  • How do teams prevent regressions when changing prompts/models/chunking strategies?

I think I’m missing the “engineering mindset” around evaluation. Until now I’ve mostly been doing:

>the outputs look good enough

But I want to learn how people build reliable evaluation pipelines and iterate systematically.

Would really appreciate:

  • practical workflows
  • examples from real projects
  • beginner-friendly resources
  • advice on what I should build to learn this properly

Especially interested in RAG + agent evaluation.

Thanks!

reddit.com
u/GlitteringNinja9367 — 6 days ago

How do AI engineers actually evaluate LLM/RAG systems in practice?

I’ve built multiple LLM/AI projects so far, but I realized I never properly learned how evaluation is actually done in real AI engineering workflows.

Recently I’ve been reading AI Engineering by Chip Huyen, and one thing that stood out was the idea that you should evaluate every layer of the system, not just the final output:

  • prompts
  • retrieval quality in RAG
  • chunking
  • reranking
  • hallucinations
  • latency/cost
  • end-to-end answer quality
  • AI-as-a-judge systems, etc.

What I’m confused about is how this is actually done in practice by engineers.

For example:

  • Do people usually create their own eval datasets?
  • Or do you use public benchmark datasets?
  • How do you evaluate retrieval quality specifically?
  • How are prompts compared systematically?
  • How much of evaluation is automated vs human review?
  • What tools/platforms are commonly used in industry right now?
  • Are frameworks like Ragas, DeepEval, LangSmith, TruLens, etc. actually used in production?
  • How do teams prevent regressions when changing prompts/models/chunking strategies?

I think I’m missing the “engineering mindset” around evaluation. Until now I’ve mostly been doing:

>the outputs look good enough

But I want to learn how people build reliable evaluation pipelines and iterate systematically.

Would really appreciate:

  • practical workflows
  • examples from real projects
  • beginner-friendly resources
  • advice on what I should build to learn this properly

Especially interested in RAG + agent evaluation.

Thanks!

reddit.com
u/GlitteringNinja9367 — 6 days ago

How do AI engineers actually evaluate LLM/RAG systems in practice?

I’ve built multiple LLM/AI projects so far, but I realized I never properly learned how evaluation is actually done in real AI engineering workflows.

Recently I’ve been reading AI Engineering by Chip Huyen, and one thing that stood out was the idea that you should evaluate every layer of the system, not just the final output:

  • prompts
  • retrieval quality in RAG
  • chunking
  • reranking
  • hallucinations
  • latency/cost
  • end-to-end answer quality
  • AI-as-a-judge systems, etc.

What I’m confused about is how this is actually done in practice by engineers.

For example:

  • Do people usually create their own eval datasets?
  • Or do you use public benchmark datasets?
  • How do you evaluate retrieval quality specifically?
  • How are prompts compared systematically?
  • How much of evaluation is automated vs human review?
  • What tools/platforms are commonly used in industry right now?
  • Are frameworks like Ragas, DeepEval, LangSmith, TruLens, etc. actually used in production?
  • How do teams prevent regressions when changing prompts/models/chunking strategies?

I think I’m missing the “engineering mindset” around evaluation. Until now I’ve mostly been doing:

>the outputs look good enough

But I want to learn how people build reliable evaluation pipelines and iterate systematically.

Would really appreciate:

  • practical workflows
  • examples from real projects
  • beginner-friendly resources
  • advice on what I should build to learn this properly

Especially interested in RAG + agent evaluation.

Thanks!

reddit.com
u/GlitteringNinja9367 — 6 days ago
▲ 6 r/LearnDataAnalytics+3 crossposts

I’m currently learning EDA properly and I’ve finished basic univariate analysis on a recommendation/e-commerce style dataset.

Now I’m moving into pairwise analysis (scatter plots, grouped summaries, correlations) and I’m confused about the strategy part.

How do you decide WHICH variable pairs are worth exploring?

Do experienced analysts:

  • systematically check almost every relationship first?
  • or only explore relationships that seem meaningful from business intuition / earlier EDA? And How do you even decide what is meaningful?

I can think of many combinations, but I’m not sure whether good EDA is:

  1. broad exploration of everything
  2. or following a few promising signals deeply

Would love to hear how more experienced people approach this in real projects.

I have attached my EDA github link below just incase if anybody wants to check that out and then answer this question

Github Link: https://github.com/Atharva22052006/Amazon_recommondation_engine

u/GlitteringNinja9367 — 7 days ago

Hey everyone! I'm teaching myself data analysis and ML by working through a real dataset. I'd love some guidance from people with more experience.

The dataset:

  • ~1.85M purchase records (Amazon order history)
  • ~5K users with survey/demographic data, linked via Survey ResponseID

What I've done so far:

EDA & consistency checks:

  • Identified 4 columns with null values: Shipping Address StateTitleASIN/ISBN, and Category
  • Confirmed ASIN is the most reliable product identifier (~95% of titles map to a single ASIN, the exceptions are gift cards, clothing lines, bulk items with multiple variants)
  • Converted Order Date to datetime

Imputation I've already done:

  • For Shipping Address State: used forward/backward fill within each user's orders. Went from 87K nulls → 24K remaining (those 24K belong to 62 users who never provided an address at all)
  • For Title ↔ ASIN: cross-filled using mode mapping in both directions
  • For Category: filled via ASIN → Category and Title → Category mappings
  • For Q-life-changes in the survey data: confirmed nulls mean "No" based on value distribution, filled accordingly

Where I'm stuck: handling remaining nulls across all 4 columns:

I know the standard advice is mean/median imputation, but all 4 of these columns are categorical/text so that doesn't apply. Here's where each one stands and what I'm considering:

  • ASIN/ISBN — After cross-filling with Title, whatever nulls remain have no recoverable identity. For a recommender, you can't really use a row if you don't know what was purchased. Leaning toward keeping these for EDA but dropping before modeling.
  • Title — Same situation as ASIN since I was cross-filling between the two. Same plan.
  • Category — Filled via ASIN and Title mappings already. Remaining nulls are products with genuinely no known category. Considering either dropping or using an "Unknown" placeholder, not sure which is better practice.
  • Shipping Address State — 24K rows from 62 users who never provided location data anywhere. These users still have valid purchase histories though. Since location probably isn't a core signal for a recommender anyway, I'm thinking of just leaving the address null and not using it as a feature, rather than dropping 24K rows.

General question on timing: Is it better to drop/handle nulls now before doing more EDA, or keep everything and only clean up right before modeling? My instinct says to keep them for the EDA because the other categories might be helpful, but I'm not sure if that's the right reasoning.

Dataset Link: https://www.kaggle.com/datasets/dharshinisraghunath/harvard-ecommerce-dataset-for-big-data-analysis

Github repo for what I have done till now: https://github.com/Atharva22052006/Amazon_recommondation_engine

I'm not looking for someone to solve it for me, just trying to understand the right thinking process. Appreciate any direction

reddit.com
u/GlitteringNinja9367 — 10 days ago

Hey everyone! I'm teaching myself data analysis and ML by working through a real dataset. I'd love some guidance from people with more experience.

The dataset:

  • ~1.85M purchase records (Amazon order history)
  • ~5K users with survey/demographic data, linked via Survey ResponseID

What I've done so far:

EDA & consistency checks:

  • Identified 4 columns with null values: Shipping Address StateTitleASIN/ISBN, and Category
  • Confirmed ASIN is the most reliable product identifier (~95% of titles map to a single ASIN, the exceptions are gift cards, clothing lines, bulk items with multiple variants)
  • Converted Order Date to datetime

Imputation I've already done:

  • For Shipping Address State: used forward/backward fill within each user's orders. Went from 87K nulls → 24K remaining (those 24K belong to 62 users who never provided an address at all)
  • For Title ↔ ASIN: cross-filled using mode mapping in both directions
  • For Category: filled via ASIN → Category and Title → Category mappings
  • For Q-life-changes in the survey data: confirmed nulls mean "No" based on value distribution, filled accordingly

Where I'm stuck: handling remaining nulls across all 4 columns:

I know the standard advice is mean/median imputation, but all 4 of these columns are categorical/text so that doesn't apply. Here's where each one stands and what I'm considering:

  • ASIN/ISBN — After cross-filling with Title, whatever nulls remain have no recoverable identity. For a recommender, you can't really use a row if you don't know what was purchased. Leaning toward keeping these for EDA but dropping before modeling.
  • Title — Same situation as ASIN since I was cross-filling between the two. Same plan.
  • Category — Filled via ASIN and Title mappings already. Remaining nulls are products with genuinely no known category. Considering either dropping or using an "Unknown" placeholder, not sure which is better practice.
  • Shipping Address State — 24K rows from 62 users who never provided location data anywhere. These users still have valid purchase histories though. Since location probably isn't a core signal for a recommender anyway, I'm thinking of just leaving the address null and not using it as a feature, rather than dropping 24K rows.

General question on timing: Is it better to drop/handle nulls now before doing more EDA, or keep everything and only clean up right before modeling? My instinct says to keep them for the EDA because the other categories might be helpful, but I'm not sure if that's the right reasoning.

Dataset Link: https://www.kaggle.com/datasets/dharshinisraghunath/harvard-ecommerce-dataset-for-big-data-analysis

Github repo for what I have done till now: https://github.com/Atharva22052006/Amazon_recommondation_engine

I'm not looking for someone to solve it for me, just trying to understand the right thinking process. Appreciate any direction

reddit.com
u/GlitteringNinja9367 — 10 days ago

Hey everyone! I'm teaching myself data analysis and ML by working through a real dataset. I'd love some guidance from people with more experience.

The dataset:

  • ~1.85M purchase records (Amazon order history)
  • ~5K users with survey/demographic data, linked via Survey ResponseID

What I've done so far:

EDA & consistency checks:

  • Identified 4 columns with null values: Shipping Address StateTitleASIN/ISBN, and Category
  • Confirmed ASIN is the most reliable product identifier (~95% of titles map to a single ASIN, the exceptions are gift cards, clothing lines, bulk items with multiple variants)
  • Converted Order Date to datetime

Imputation I've already done:

  • For Shipping Address State: used forward/backward fill within each user's orders. Went from 87K nulls → 24K remaining (those 24K belong to 62 users who never provided an address at all)
  • For Title ↔ ASIN: cross-filled using mode mapping in both directions
  • For Category: filled via ASIN → Category and Title → Category mappings
  • For Q-life-changes in the survey data: confirmed nulls mean "No" based on value distribution, filled accordingly

Where I'm stuck: handling remaining nulls across all 4 columns:

I know the standard advice is mean/median imputation, but all 4 of these columns are categorical/text so that doesn't apply. Here's where each one stands and what I'm considering:

  • ASIN/ISBN — After cross-filling with Title, whatever nulls remain have no recoverable identity. For a recommender, you can't really use a row if you don't know what was purchased. Leaning toward keeping these for EDA but dropping before modeling.
  • Title — Same situation as ASIN since I was cross-filling between the two. Same plan.
  • Category — Filled via ASIN and Title mappings already. Remaining nulls are products with genuinely no known category. Considering either dropping or using an "Unknown" placeholder, not sure which is better practice.
  • Shipping Address State — 24K rows from 62 users who never provided location data anywhere. These users still have valid purchase histories though. Since location probably isn't a core signal for a recommender anyway, I'm thinking of just leaving the address null and not using it as a feature, rather than dropping 24K rows.

General question on timing: Is it better to drop/handle nulls now before doing more EDA, or keep everything and only clean up right before modeling? My instinct says to keep them for the EDA because the other categories might be helpful, but I'm not sure if that's the right reasoning.

Dataset Link: https://www.kaggle.com/datasets/dharshinisraghunath/harvard-ecommerce-dataset-for-big-data-analysis

Github repo for what I have done till now: https://github.com/Atharva22052006/Amazon_recommondation_engine

I'm not looking for someone to solve it for me, just trying to understand the right thinking process. Appreciate any direction

reddit.com
u/GlitteringNinja9367 — 10 days ago

Hey everyone! I'm teaching myself data analysis and ML by working through a real dataset. I'd love some guidance from people with more experience.

The dataset:

  • ~1.85M purchase records (Amazon order history)
  • ~5K users with survey/demographic data, linked via Survey ResponseID

What I've done so far:

EDA & consistency checks:

  • Identified 4 columns with null values: Shipping Address StateTitleASIN/ISBN, and Category
  • Confirmed ASIN is the most reliable product identifier (~95% of titles map to a single ASIN, the exceptions are gift cards, clothing lines, bulk items with multiple variants)
  • Converted Order Date to datetime

Imputation I've already done:

  • For Shipping Address State: used forward/backward fill within each user's orders. Went from 87K nulls → 24K remaining (those 24K belong to 62 users who never provided an address at all)
  • For Title ↔ ASIN: cross-filled using mode mapping in both directions
  • For Category: filled via ASIN → Category and Title → Category mappings
  • For Q-life-changes in the survey data: confirmed nulls mean "No" based on value distribution, filled accordingly

Where I'm stuck: handling remaining nulls across all 4 columns:

I know the standard advice is mean/median imputation, but all 4 of these columns are categorical/text so that doesn't apply. Here's where each one stands and what I'm considering:

  • ASIN/ISBN — After cross-filling with Title, whatever nulls remain have no recoverable identity. For a recommender, you can't really use a row if you don't know what was purchased. Leaning toward keeping these for EDA but dropping before modeling.
  • Title — Same situation as ASIN since I was cross-filling between the two. Same plan.
  • Category — Filled via ASIN and Title mappings already. Remaining nulls are products with genuinely no known category. Considering either dropping or using an "Unknown" placeholder, not sure which is better practice.
  • Shipping Address State — 24K rows from 62 users who never provided location data anywhere. These users still have valid purchase histories though. Since location probably isn't a core signal for a recommender anyway, I'm thinking of just leaving the address null and not using it as a feature, rather than dropping 24K rows.

General question on timing: Is it better to drop/handle nulls now before doing more EDA, or keep everything and only clean up right before modeling? My instinct says to keep them for the EDA because the other categories might be helpful, but I'm not sure if that's the right reasoning.

Dataset Link: https://www.kaggle.com/datasets/dharshinisraghunath/harvard-ecommerce-dataset-for-big-data-analysis

Github repo for what I have done till now: https://github.com/Atharva22052006/Amazon_recommondation_engine

I'm not looking for someone to solve it for me, just trying to understand the right thinking process. Appreciate any direction

reddit.com
u/GlitteringNinja9367 — 10 days ago