u/Careless_Handle8112

Hi all,

I'd love to hear your take on the approach to evaluate a langgraph graph, both offline during development and online during production.

A. Background

I recently built a POC with langgraph to perform a complex workflow on company long-form documents. There are quite a number of nodes to produce relatively acceptable final outputs, from content detection, reasoning, applying business knowledge, classification, structure output...
The final outputs need to contain a nested JSON, which combines different structured outputs from different worker nodes.

B. Challenges

As this is a new use case, there's no prior ground truth dataset. I need to bootstrap some high-level evaluation sets for just sampling and vibe checking the final outputs.
Evaluating final outputs proves to be insufficient, because an error can propagate from intermediate nodes, while there's nothing wrong with other nodes.
Designing test cases to evaluate the final outputs is challenging because of the highly nestes structure, which can be subjected to changes.

C. What I'm trying now:

Building custom wrappers to evaluate each node. The scorers can be LLM judges or code-based.
The evaluation process is similar to evaluating a MLflow model, where I can log the prompts, the evaluation metrics, datasets...
I can examine the scorer evaluation to gradually create a golden dataset for reference-based evaluation. this would unavoidably take effort from the business side. If I have 10 LLM nodes, I'd need 10 evaluation datasets. only the 1st few nodes, at best, will take advantage of the business input, the rest may need custom inputs for test cases.

D. My questions:

I can see some merits of node-based evaluation, but I also foresee the big effort in repeatedly doing it for all nodes. There may be changes to a node logic or output structure, hence its evaluation logic and golden set can be subjective to changes, adding more effort. Do you think it's a worthwhile idea?
Is there a more efficient approach to do graph evaluation?
Am I overlooking or missing on anything?