I'm building a stress test workflow to benchmark document extraction – here's what I'm testing
👋 Hey VibeCoders,
Over the past few weeks I've been sharing workflows that use document extraction for things like currency conversion, invoice classification, duplicate detection, and Slack-based approvals. One question that keeps coming up – from myself and from people trying these workflows – is: how far can you push the extraction before it breaks?
Clean PDFs are easy. Every solution handles those. But what about a scanned invoice with coffee stains? A photo taken at an angle? A completely different layout than what the pipeline was trained on? A document that looks like someone used it as a coaster, scribbled notes all over it, and then left it in the rain?
I wanted to answer that properly, so I'm building a stress test workflow.
The idea:
Upload a document through a web form, extract the data, compare every single field against the known correct values, and get a results page with a per-field pass/fail breakdown and an overall accuracy percentage. Since the test always uses the same invoice data, the ground truth is fixed – you're purely measuring how well the extraction handles degraded quality and layout changes.
The test documents I'm preparing:
I'm going to run four versions of the same invoice through the workflow:
- Original – clean PDF, the baseline. Should be 100%.
- Layout Variant A – same data, completely different visual layout
- Layout Variant B – another layout, different structure again
- Version 7 ("The Survivor") – this one has coffee stains, pen annotations ("WRONG ADDRESS? check billing!"), scribbled-out sections, burn marks, and a circled-over amount due field. If anything can extract data from this, I'll be impressed.
I spent some time thinking about what makes a good stress test. Different layouts test whether the extraction actually reads the document or just memorises positions. The destroyed version tests OCR resilience when half the text is obstructed. Together they should give a pretty honest picture of where a solution actually stands.
What's coming next week:
I'm going to build out the full workflow, run all four documents through it, and share the results here – accuracy percentages across every version, including the destroyed one. I'll also share the workflow JSON, so anyone can import it and run their own benchmarks.
The workflow will be solution-agnostic too – you'll be able to swap out the extraction node for an HTTP Request node pointing at any other API, and the entire validation chain works identically. Good way to benchmark different tools side by side.
Curious to see where it breaks. Would love to hear if anyone else has been stress testing their extraction setups, or if you have ideas for even nastier test documents.
Best,
Felix