u/jasperc_6 — reddlx

Deepseek v4 is better for rag pipeline debugging than claude opus

i have been optimizing a rag system with 12 different embedding models and retrieval strategies. Initially used claude opus 4.7 thru anthropic api for the analysis but hit walls when diagnosing performance bottlenecks across the full pipeline. The task was - how retrival failures in one component cascade thru the system embedding mismatches affaecting chunk relevance which degrades reranking… which throws off cobtext assembly.

i needed to see the entire pipeline as interconnected failure modes, opus analyzed each component well indivudually but it treated them as isolated issues instead of model cascade effects. then switched to deepsek via deepinfra api with the same logs and metrics but this time deepseek mapped the full system and showed how embedding model A's poor performance on technical jargon triggered downstream reranker failures causinjg context window pollution, creating feedback loops that opus had missed. The multi component analysis captured interdependencies that opus didnt quite hold simultaenously

opus still wins on code, no doubt on that but for tracing failure propogation across complex multi stage pipelines deepseeks analytical depth on interconnected system behaviour is much stronger. When debugging cross component issues where one failures triggers the three others deepseek identified the root cause faster usually pointing to the upstream component.

ran both the models on same 2 week diagnostic log spanning 8 million requests.. On one side opus produced 14 isolated recommendations per component while deepseek produced 6 system level changes that showed interaction failures. Implemented deepseeks suggestions first and fixed 11 of the 14 issues that opus had flagged

anyone else using multiple models for their rag debugging?? interested in hearing which model combinations you've found work best for multi-component failure analysis....

reddit.com

u/jasperc_6 — 3 days ago

▲ 1 r/n8n

Vendor invoice reconciliation for our team

Our AP process was a total mess, small ops team, 40-60 vendor invoices incoming from in formats like some as pdf some as scanned images, some as docx, vendors using different abbreviations for the same company name, line item descriptions that never quite matched what was on the original PO. finance was spending 3/4 hrs/week just on manual reconciliation before pointing out the exceptions

The pain point was the three way matching process... every invoice needed to reconcile against the original purchase order and the goods recipt before payment gets approved. When youre doing that by hand across inconsistent document formats the error rates raise. Industry data puts manual matching at missing 10-15% of discrepancies and were actually hitting that. So to automate this process thru we made an n8n pipeline to handle it. The problem before any matching logic could run was that the agent doing the cross validation had no reliable data for it to reason over. Vendor invoices arriving in various formats and vendors using different field names for the same values, so passing these data raw to the llm might would have hallucinated so added a parsing layer before the claude deepseek reasoning model so that the data gets processed and clean before the model sees it.

Incoming invoice lands in a designated google drive folder, n8n triggets and fetchs them, llamaparse extracts structured fields by pydantic schema.... vendor name, invoice number, line items, unit prices, quantities, total amount, tax etc. same extraction runs against the corresponding PO and then based on that cross validation agent compares the three documents and flags the exceptions. The agent reason over clean typed fieldfs not raw document content which is what makes the matching more reliable across different vendors and their inputs.

The extraction schema for each invoice:

class VendorInvoice(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str
    po_reference: str
    line_items: list[LineItem]
    subtotal: float
    tax_amount: float
    total_amount: float

here the parser returns confience scores per field. Low confidence on a vendor name or PO number flags the document for human review before matching agent runs. That layer alone catches extraction uncertainity before it becomes a wrong validation decision lateron

Real exception from first week in work:

MISMATCH [vendor_name]: invoice = "TechSupplies Inc" | PO = "Tech Supplies Incorporated" | confidence: 0.71
MISMATCH [unit_price]: invoice line 4 = $84.50 | PO line 4 = $79.00 | variance: +6.9% above threshold
MISMATCH [quantity]: invoice qty = 150 | goods receipt qty = 142 | delta: 8 units unaccounted
FLAG [duplicate]: INV-2024-0891 matches prior submission INV-2024-0812 same vendor same amount $4,200 eleven days apart
CONFIDENCE LOW [po_number]: 0.62 extracted value "PO-2024-1847" uncertain manual review required

here, the vendor name abbreviation mismatch was the most common problem. Same supplier, different format across invoice and PO. previously getting waved thru manually because a human would recognize it.

Batch size 1 per loop iteration in n8n, processing speed matters less than consistent matching and cant take any risks here so kept it as 1 at a time. Flagged exceptions route to slack with the specific fields attached so the reviewer sees exactly what needs to be checked

Still working on tolerance thresholds and currently flagging above 5% price variance... if anyone else has dealt with such inconsistent inputs from vendors or other sources please let me know you use acceptable variance before raising an exception

reddit.com

u/jasperc_6 — 8 days ago

▲ 10 r/Rag

most rag issue s blamed on embeddings or the llm trace to chunking strategy locked in during setup and never revisited

small chunks lose context large chunks bury the answer, fixed size chunking respects neither because document structure never aligns with token boundaries.

what actually works here:

semantic chunking that follows document structure like the headings, sections paragraphs as natural boundaries not arbitrary token counts
hierarchical indexing for long docs and summary chunks for broad questions, detail chunks for specific ones
chunk overlap helps at the margins but doesn't fix a bad strategy

the practical audit before locking in any config would be printing retrieved chunks for 20 real queries and read them. if the answer is consistently split across two chunks, size is too small. if the answer is buried in unrelated content, size is too large

most teams set this once and spend months tuning everything downstream instead of going back to fix the root problem.

reddit.com

u/jasperc_6 — 14 days ago