u/Fabulous-Pea-5366

I built a customer support AI for German complience that auto-serves invoices and deflects 39.5% of queries. here's the architecture

Just got the first week of production data back from a customer support AI system I built for a German compliance company. 39.5% deflection rate across 43 conversations. Want to break down the architecture because support chatbots get a bad reputation and most of it is deserved.

The system handles email and chat conversations. When a customer reaches out the system does three things:

1. Intent classification. Determines what the customer actually wants. The current intent categories are: termination, onboarding, invoice requests, legal advice, general questions, technical issues, integration questions, GDPR questions, and account management. This classification drives what happens next.

2. Outcome routing. Based on the intent and the system's confidence in handling it, the conversation gets routed to one of four outcomes:

  • Deflected (39.5%): AI resolves the query completely
  • Invoice served (19%): system automatically pulls and delivers the requested invoice
  • Ticket created (19%): complex query gets escalated to a human agent
  • Collecting info (16%): system is still gathering details before routing

3. Response generation. For deflectable queries the system generates a response grounded in the company's actual documentation and policies. Not generic FAQ answers. Actual answers sourced from their knowledge base.

What makes this work better than most support chatbots:

The intent classification isn't just keyword matching. The system understands that "I want to stop my subscription" and "how do I cancel" and "we're discontinuing the service" are all termination intents even though they share almost no words.

The escalation logic errs on the side of creating tickets. If the system isn't confident it can fully resolve the query it escalates rather than giving a bad answer. This is why the deflection rate is 39.5% and not some inflated 80% number. Every deflected conversation is a genuinely resolved query.

The invoice serving is fully automated. Customer asks for an invoice, system identifies the intent, pulls the relevant invoice, and delivers it. This single feature handles 19% of all conversations without any human involvement.

Average response time is 28 seconds. For comparison the same query handled by a human agent involves reading the email, looking up the customer, finding the relevant information, and composing a response. Even a fast agent takes 5-10 minutes.

The interesting part is that this runs alongside an internal RAG system I built for the same client. Their team has AI handling customer-facing support AND AI handling internal legal research. The humans focus on the work that actually requires human judgment: complex legal analysis, sensitive customer conversations, strategic decisions.

Week one data is a small sample (43 conversations) but the deflection rate and intent distribution give a good baseline for tuning. The main optimization targets are improving deflection on onboarding questions and general queries where the system is currently creating tickets it could probably handle.

reddit.com
u/Fabulous-Pea-5366 — 22 hours ago

I built two AI systems for the same client. one for their team, one for their customers. here's the combined ROI

I've been posting about building AI systems for a German compliance company. Most of the discussion was about the internal research tool I built for their legal team. But I also built them a customer-facing support system and I just got the first week of real data back.

Quick context on both systems:

System 1: Internal legal research assistant. Their compliance team searches through 60+ legal documents (court decisions, GDPR guidelines, authority opinions). Before the system they'd spend 30-45 minutes per research question manually searching PDFs. Now they type a question in plain language and get cited answers in under a minute.

System 2: Customer support chatbot. Their clients (businesses that use their compliance service) send questions via email and chat. Things like invoice requests, onboarding questions, GDPR questions, termination requests, technical issues.

Here's the first week of real data from the support system:

  • 43 total conversations handled
  • 39.5% deflection rate. Meaning 17 out of 43 customer queries were resolved fully by the AI without any human touching them
  • 19% of conversations resulted in invoices being served automatically
  • Only 19% needed a human agent (ticket created)
  • Average response time: 28 seconds

The deflection rate is the number that matters most. Every deflected conversation is time a support agent didn't have to spend reading an email, looking up information, and typing a response. Even if each interaction only takes 10-15 minutes manually, at 17 deflected conversations in one week that's roughly 3-4 hours of support labor saved. Per week. From day one.

And this is week one. The system is still learning which queries it handles well and which need human escalation. Deflection rates typically improve as you tune the intent detection and add more knowledge to the system.

What I find interesting is the intent distribution. The top category is termination requests (customers wanting to cancel). Those are sensitive conversations that probably need a human touch in most cases. But simpler stuff like invoice requests and general questions get handled automatically without any quality drop from the customer's perspective.

The combination of both systems is where the real value is for the client. Their internal team spends less time on research. Their support team spends less time on repetitive customer queries. The AI handles the routine stuff on both sides so the humans can focus on work that actually requires their expertise.

If you're building AI systems for clients, think about this dual approach. Most companies have both an internal knowledge problem AND a customer-facing support problem. Solving both makes you way harder to replace than solving just one.

reddit.com
u/Fabulous-Pea-5366 — 22 hours ago

I built two AI systems for the same client. one for their team, one for their customers. here's the combined ROI

I've been posting about building AI systems for a German compliance company. Most of the discussion was about the internal research tool I built for their legal team. But I also built them a customer-facing support system and I just got the first week of real data back.

Quick context on both systems:

System 1: Internal legal research assistant. Their compliance team searches through 60+ legal documents (court decisions, GDPR guidelines, authority opinions). Before the system they'd spend 30-45 minutes per research question manually searching PDFs. Now they type a question in plain language and get cited answers in under a minute.

System 2: Customer support chatbot. Their clients (businesses that use their compliance service) send questions via email and chat. Things like invoice requests, onboarding questions, GDPR questions, termination requests, technical issues.

Here's the first week of real data from the support system:

  • 43 total conversations handled
  • 39.5% deflection rate. Meaning 17 out of 43 customer queries were resolved fully by the AI without any human touching them
  • 19% of conversations resulted in invoices being served automatically
  • Only 19% needed a human agent (ticket created)
  • Average response time: 28 seconds

The deflection rate is the number that matters most. Every deflected conversation is time a support agent didn't have to spend reading an email, looking up information, and typing a response. Even if each interaction only takes 10-15 minutes manually, at 17 deflected conversations in one week that's roughly 3-4 hours of support labor saved. Per week. From day one.

And this is week one. The system is still learning which queries it handles well and which need human escalation. Deflection rates typically improve as you tune the intent detection and add more knowledge to the system.

What I find interesting is the intent distribution. The top category is termination requests (customers wanting to cancel). Those are sensitive conversations that probably need a human touch in most cases. But simpler stuff like invoice requests and general questions get handled automatically without any quality drop from the customer's perspective.

The combination of both systems is where the real value is for the client. Their internal team spends less time on research. Their support team spends less time on repetitive customer queries. The AI handles the routine stuff on both sides so the humans can focus on work that actually requires their expertise.

If you're building AI systems for clients, think about this dual approach. Most companies have both an internal knowledge problem AND a customer-facing support problem. Solving both makes you way harder to replace than solving just one.

reddit.com
u/Fabulous-Pea-5366 — 22 hours ago

I built a system where senior lawyers can correct the AI's knowledge by leaving comments on documents. here's why it matters more than better embeddings

When I built an AI research assistant for a law firm, the feature I thought would be a nice-to-have turned out to be the one they use most.

The system has an annotation feature. Any user can select text in a document and leave a comment. Something like "this interpretation was overruled by ruling X in 2024" or "this applies only to NRW, not nationally" or "our firm's position differs, see internal memo Y."

Technically here's what happens. Comments are stored in PostgreSQL linked to the document ID, page number, and selected text. When a query comes in, the system does two things. First it fetches comments attached to the specific documents that were retrieved by vector search. Second it fetches ALL comments across ALL documents regardless of what was retrieved. Both get injected into the LLM's context.

The second part is important. If a senior lawyer annotated document A saying "this is outdated" but the query only retrieved documents B and C, the system still sees that annotation through the global comments injection. The cache refreshes every 60 seconds so new comments are picked up almost immediately.

The prompt tells the model to treat these annotations as authoritative expert notes and to prioritize them when they contradict the document text.

Why this matters more than I initially thought:

Legal knowledge goes stale. A court ruling from 2022 might be superseded by a 2024 decision. Without the annotation system you'd need to re-ingest documents, update metadata, maybe re-chunk everything. With annotations a senior lawyer just writes "superseded by X" and the system incorporates that knowledge on the next query. No engineering work needed.

It also captures institutional knowledge that doesn't exist in any document. Things like "our firm interprets this more conservatively than the standard reading" or "client X has specific requirements around this clause." That kind of knowledge lives in senior lawyers' heads and normally gets lost when they retire or leave.

The legal team started using it within the first week without any training. They were already used to annotating PDFs with comments. This just made those comments searchable and part of the AI's knowledge base.

If you're building RAG for any domain where expert interpretation matters (legal, medical, financial, academic), consider building an annotation layer. Better embeddings and fancier retrieval will improve your baseline. But letting domain experts directly correct and enrich the AI's knowledge is a multiplier that no amount of model improvement can replicate.

reddit.com
u/Fabulous-Pea-5366 — 2 days ago

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

I built a RAG system that needs to answer in German or English depending on the query language. Sounds simple. It was not.

The source documents are mostly in German but some contain French legal terminology, Latin phrases, and occasional English citations. What kept happening was the LLM would start answering in German, hit a French passage in the context, and just.. switch to French mid-paragraph. Sometimes it would blend German and French in the same sentence. Once it answered entirely in Italian and I still have no idea why.

I tried letting the LLM detect the query language itself. Unreliable. It would sometimes decide the query was in French because the user mentioned a French court case by name.

What actually worked was a dumb regex detector. I check the query for common German words (der, die, das, und, ist, nicht, mit, für, datenschutz, verletzung, etc). If enough German markers are present the response language is forced to German. Otherwise English. No fancy language detection library. Just pattern matching.

Then in the prompt I added a hard constraint: "Write your entire answer ONLY in {language}. Output must be German or English only. Never French, Spanish, Italian, or any other language. If the retrieved context is partly in another language, translate your answer into {language} only."

The "never French" part is doing heavy lifting. Without that explicit prohibition the model would drift back into French within a few days of testing. It's like the model sees French legal text in context and thinks "oh we're doing French now."

Anyone else building multilingual RAG systems running into this? The language contamination from source documents was the most annoying bug I dealt with and I've seen almost nobody write about it.

reddit.com
u/Fabulous-Pea-5366 — 2 days ago

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

I built a RAG system that needs to answer in German or English depending on the query language. Sounds simple. It was not.

The source documents are mostly in German but some contain French legal terminology, Latin phrases, and occasional English citations. What kept happening was the LLM would start answering in German, hit a French passage in the context, and just.. switch to French mid-paragraph. Sometimes it would blend German and French in the same sentence. Once it answered entirely in Italian and I still have no idea why.

I tried letting the LLM detect the query language itself. Unreliable. It would sometimes decide the query was in French because the user mentioned a French court case by name.

What actually worked was a dumb regex detector. I check the query for common German words (der, die, das, und, ist, nicht, mit, für, datenschutz, verletzung, etc). If enough German markers are present the response language is forced to German. Otherwise English. No fancy language detection library. Just pattern matching.

Then in the prompt I added a hard constraint: "Write your entire answer ONLY in {language}. Output must be German or English only. Never French, Spanish, Italian, or any other language. If the retrieved context is partly in another language, translate your answer into {language} only."

The "never French" part is doing heavy lifting. Without that explicit prohibition the model would drift back into French within a few days of testing. It's like the model sees French legal text in context and thinks "oh we're doing French now."

Anyone else building multilingual RAG systems running into this? The language contamination from source documents was the most annoying bug I dealt with and I've seen almost nobody write about it.

reddit.com
u/Fabulous-Pea-5366 — 2 days ago

Every law firm I talk to has the same problem and none of them have solved it

Since posting about the AI research system I built for a German law firm I've been having conversations with lawyers in different countries. The pattern is identical everywhere.

The problem: firms accumulate years of internal knowledge in documents. Court decisions, case files, internal memos, regulatory guidance, client correspondence. This knowledge is incredibly valuable. But nobody can efficiently search it.

When a new question comes in, junior associates dig through folder structures trying to find relevant precedents. They search by filename. They ask senior colleagues "didn't we handle something like this before?" They spend 30-60 minutes finding what they need when the answer exists somewhere in documents the firm already has.

The irony is these firms sell their expertise by the hour but waste enormous amounts of billable time on internal knowledge retrieval.

What's interesting is why nobody has solved this for most firms:

  • Big legal tech companies (Westlaw, LexisNexis) focus on external legal databases not internal firm knowledge.
  • Generic AI tools don't understand legal authority hierarchy. A ChatGPT wrapper treats a blog post and a Supreme Court ruling with equal weight. Lawyers can't trust that.
  • Most firms don't have internal tech teams. They rely on IT support for email and printers. Nobody is building custom AI tools.
  • The firms that do have tech teams are building for client-facing products not internal knowledge management.

This creates a massive gap. The firms need custom AI systems built for their specific documents with their specific domain requirements. But there's almost nobody offering that service because developers don't think to target law firms and law firms don't know what to ask for.

The thing I didn't expect is how much of the architecture carries over. The authority hierarchy logic, the citation enforcement, the jurisdictional tagging, the annotation layer. Most of it isn't specific to one firm. A different compliance team or law firm would have the same structural needs just with their own documents and maybe slightly different authority tiers.

I spent most of the project solving problems I thought were unique to this client but turned out to be universal to how legal professionals work with documents. That realization changed how I think about what I actually built.

reddit.com
u/Fabulous-Pea-5366 — 3 days ago

5 things I'd do differently if I rebuilt my legal AI system from scratch

I shipped a RAG system for a German law firm a few months ago. It's in production and working well. But looking back there are things I'd approach differently on the next build.

1. I'd instrument token usage from day one.

Someone asked me in the comments how many tokens I'm burning per search. I didn't have a precise answer because I didn't build token logging into the system from the start. The response metadata tracks chunk-level token counts but I'm not logging full prompt token counts per query. This matters for cost projections when pitching to new clients. "Roughly reasonable" is not a good answer when a managing partner asks about ongoing costs.

2. I'd build retrieval quality monitoring before going to production.

Right now quality monitoring is essentially client feedback. The legal team tells me when something is wrong. That works at small scale but it's fragile. I should have built automated evaluation: logging query-response pairs, tracking citation accuracy rates, flagging when the system says "no relevant information found" (which might indicate a retrieval gap). Anyone have good approaches for automated RAG quality eval in production? Genuinely asking.

3. I'd charge at least 3x more.

I charged €2,700. Multiple people in different communities told me this should have been €8,000-15,000. They're right. The system saves the firm far more than that per month in recovered associate time. Undercharging didn't just cost me money, it also positioned me as cheaper and less specialized than I am. The €1,300/month maintenance partially makes up for it but the upfront should have been higher.

4. I'd set up a proper demo environment earlier.

For my next prospects I want to be able to say "here, try it with sample documents." Having a live demo with anonymized legal documents would dramatically shorten the sales cycle. Right now I describe what the system does. Showing is always better than telling.

5. I'd standardize the authority tier configuration.

The 8-tier authority hierarchy (internal opinion > high court > low court > authority opinion > guidelines > literature > real world context > general content) was somewhat custom to this client. But talking to other lawyers, most legal systems follow a very similar hierarchy. I should have built it as a configurable module from the start rather than encoding it into the prompts. That way each new client can adjust the tiers through an admin panel without touching the prompt templates.

None of these are system-breaking issues. The product works and the client is happy. But if I'd done these five things from the start the system would be more scalable, more monitorable, and I'd have more money in my pocket.

reddit.com
u/Fabulous-Pea-5366 — 3 days ago

I almost lost a client because my AI system cited a lower court ruling as if it came from the Supreme Court

I build AI systems for professional services firms. During testing of a legal research assistant I built for a German law firm, one of the senior lawyers flagged something that could have been a serious problem.

The system was asked about a specific GDPR interpretation. It returned a correct answer but attributed a lower court's more expansive interpretation to the higher court. Essentially it said "the EuGH (European Court of Justice) ruled that X" when actually X was the position of a regional labor court. The EuGH's actual position was more conservative.

In a normal chatbot this is a minor accuracy issue. In legal work this is potentially dangerous. A lawyer reading that output might advise a client based on what they think is a Supreme Court ruling when it's actually just one regional court's interpretation. The legal weight of those two sources is completely different.

What went wrong technically: the LLM had context from multiple authority levels and when synthesizing the answer it grabbed the clearest phrasing rather than the highest authority position. The lower court happened to explain the concept in more accessible language. The higher court's ruling used denser legal terminology. The LLM essentially optimized for clarity over accuracy of attribution.

How I fixed it:

  • Added explicit prompt instructions requiring the LLM to check which category section a document belongs to before attributing it. "A finding from [Category: High court decision] must be attributed to the high court, not to a lower court."
  • Added a requirement that when courts at different levels disagree, both positions must be presented separately with correct attribution. No flattening into consensus.
  • Added specific examples in the prompt showing correct vs incorrect attribution so the LLM has a reference pattern to follow.

After these changes the system correctly presents something like: "The EuGH established that X requires conditions A, B, and C. However, the ArbG Oldenburg (regional labor court) has taken a broader position, holding that condition A alone may be sufficient. This represents a divergence from the higher court's framework."

The senior lawyer who caught this was actually impressed that we fixed it within a day. He said most legal tech tools he's evaluated don't handle authority attribution at all, they just return text without any awareness of which court said what.

This experience taught me that in high-stakes domains, the subtle errors are more dangerous than the obvious ones. A hallucinated answer is easy to spot. A correctly sourced answer with wrong attribution looks credible and that's exactly what makes it dangerous.

reddit.com
u/Fabulous-Pea-5366 — 3 days ago

Why I stopped using pure vector search for legal documents and switched to authority-weighted retrieval

I've been building RAG systems for about a year and recently shipped one for a German law firm that taught me something I wish I'd known earlier. Standard vector similarity ranking is actively dangerous for legal use cases.

Here's what I mean. In a basic RAG setup you embed the query, find the most semantically similar chunks, stuff them into context, and ask the LLM to synthesize an answer. This works great for general knowledge bases where all sources are roughly equal in reliability.

In legal work, sources are absolutely not equal. A Supreme Court ruling carries more weight than a regional court opinion. A regulatory authority's official guideline is more authoritative than a law review article. An internal expert annotation from a senior partner should override all of these for the firm's purposes.

The problem is that cosine similarity doesn't know any of this. A well-written blog post about GDPR might score higher similarity to the query than the actual court ruling on the same topic simply because the blog uses more natural language while the ruling uses dense legal terminology.

I watched this happen in testing. Asked the system about data breach notification requirements. The top retrieved chunks were from a professional literature source that used very clear, query-friendly language. The actual binding court decision that established the definitive interpretation was ranked 4th because legal German is dense and formal.

If the system builds its answer primarily from the professional literature and only briefly mentions the court decision, a lawyer reading that answer gets a subtly wrong picture of the legal landscape.

So I built three retrieval strategies:

Flat is the baseline. Standard RAG. All sources equal. Used this as a comparison baseline and it's still useful for simple factual lookups where authority doesn't matter.

Category Priority groups the retrieved chunks by their document category (high court, low court, authority opinion, guideline, literature, etc) and the prompt template explicitly tells the LLM to synthesize top-down starting from the highest authority. When sources conflict, higher authority wins. When lower courts take a more expansive position than higher courts, both positions must be presented separately. This was the single biggest quality improvement.

Layered Category runs a separate vector search per category. This guarantees that every authority level gets representation in the final context even if one category dominates similarity scores. Without this, a corpus heavy in professional literature (which tends to be well-written and semantically rich) can crowd out the sparser but more authoritative court decisions.

The category metadata comes from the documents themselves. When documents are uploaded the client tags them with category, jurisdiction, date, and framework. This metadata gets enriched during retrieval so the LLM sees something like "[Chunk from: EuGH C-300/21 | category: High court decision | region: EU | date: 2023-12-14]" before the actual content.

The prompt engineering was the other half of the battle. I have explicit negative instructions preventing the LLM from doing things like:

  • Citing "according to professional literature" without naming the specific document
  • Writing "(Kategorie: High court decision)" as an inline citation instead of the actual court name
  • Attributing a finding to the wrong authority level (e.g. claiming a lower court said something that was actually from a higher court)
  • Flattening divergent positions into false consensus

Each of these negative instructions was added because I caught the LLM doing exactly that thing during testing.

The takeaway for anyone building domain-specific RAG: think carefully about whether your sources have an inherent reliability hierarchy. If they do, standard vector similarity ranking will mislead your users in ways that are hard to detect without domain expertise.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

Why I stopped using pure vector search for legal documents and switched to authority-weighted retrieval

I've been building RAG systems for about a year and recently shipped one for a German law firm that taught me something I wish I'd known earlier. Standard vector similarity ranking is actively dangerous for legal use cases.

Here's what I mean. In a basic RAG setup you embed the query, find the most semantically similar chunks, stuff them into context, and ask the LLM to synthesize an answer. This works great for general knowledge bases where all sources are roughly equal in reliability.

In legal work, sources are absolutely not equal. A Supreme Court ruling carries more weight than a regional court opinion. A regulatory authority's official guideline is more authoritative than a law review article. An internal expert annotation from a senior partner should override all of these for the firm's purposes.

The problem is that cosine similarity doesn't know any of this. A well-written blog post about GDPR might score higher similarity to the query than the actual court ruling on the same topic simply because the blog uses more natural language while the ruling uses dense legal terminology.

I watched this happen in testing. Asked the system about data breach notification requirements. The top retrieved chunks were from a professional literature source that used very clear, query-friendly language. The actual binding court decision that established the definitive interpretation was ranked 4th because legal German is dense and formal.

If the system builds its answer primarily from the professional literature and only briefly mentions the court decision, a lawyer reading that answer gets a subtly wrong picture of the legal landscape.

So I built three retrieval strategies:

Flat is the baseline. Standard RAG. All sources equal. Used this as a comparison baseline and it's still useful for simple factual lookups where authority doesn't matter.

Category Priority groups the retrieved chunks by their document category (high court, low court, authority opinion, guideline, literature, etc) and the prompt template explicitly tells the LLM to synthesize top-down starting from the highest authority. When sources conflict, higher authority wins. When lower courts take a more expansive position than higher courts, both positions must be presented separately. This was the single biggest quality improvement.

Layered Category runs a separate vector search per category. This guarantees that every authority level gets representation in the final context even if one category dominates similarity scores. Without this, a corpus heavy in professional literature (which tends to be well-written and semantically rich) can crowd out the sparser but more authoritative court decisions.

The category metadata comes from the documents themselves. When documents are uploaded the client tags them with category, jurisdiction, date, and framework. This metadata gets enriched during retrieval so the LLM sees something like "[Chunk from: EuGH C-300/21 | category: High court decision | region: EU | date: 2023-12-14]" before the actual content.

The prompt engineering was the other half of the battle. I have explicit negative instructions preventing the LLM from doing things like:

  • Citing "according to professional literature" without naming the specific document
  • Writing "(Kategorie: High court decision)" as an inline citation instead of the actual court name
  • Attributing a finding to the wrong authority level (e.g. claiming a lower court said something that was actually from a higher court)
  • Flattening divergent positions into false consensus

Each of these negative instructions was added because I caught the LLM doing exactly that thing during testing.

The takeaway for anyone building domain-specific RAG: think carefully about whether your sources have an inherent reliability hierarchy. If they do, standard vector similarity ranking will mislead your users in ways that are hard to detect without domain expertise.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

Why I stopped using pure vector search for legal documents and switched to authority-weighted retrieval

I've been building RAG systems for about a year and recently shipped one for a German law firm that taught me something I wish I'd known earlier. Standard vector similarity ranking is actively dangerous for legal use cases.

Here's what I mean. In a basic RAG setup you embed the query, find the most semantically similar chunks, stuff them into context, and ask the LLM to synthesize an answer. This works great for general knowledge bases where all sources are roughly equal in reliability.

In legal work, sources are absolutely not equal. A Supreme Court ruling carries more weight than a regional court opinion. A regulatory authority's official guideline is more authoritative than a law review article. An internal expert annotation from a senior partner should override all of these for the firm's purposes.

The problem is that cosine similarity doesn't know any of this. A well-written blog post about GDPR might score higher similarity to the query than the actual court ruling on the same topic simply because the blog uses more natural language while the ruling uses dense legal terminology.

I watched this happen in testing. Asked the system about data breach notification requirements. The top retrieved chunks were from a professional literature source that used very clear, query-friendly language. The actual binding court decision that established the definitive interpretation was ranked 4th because legal German is dense and formal.

If the system builds its answer primarily from the professional literature and only briefly mentions the court decision, a lawyer reading that answer gets a subtly wrong picture of the legal landscape.

So I built three retrieval strategies:

Flat is the baseline. Standard RAG. All sources equal. Used this as a comparison baseline and it's still useful for simple factual lookups where authority doesn't matter.

Category Priority groups the retrieved chunks by their document category (high court, low court, authority opinion, guideline, literature, etc) and the prompt template explicitly tells the LLM to synthesize top-down starting from the highest authority. When sources conflict, higher authority wins. When lower courts take a more expansive position than higher courts, both positions must be presented separately. This was the single biggest quality improvement.

Layered Category runs a separate vector search per category. This guarantees that every authority level gets representation in the final context even if one category dominates similarity scores. Without this, a corpus heavy in professional literature (which tends to be well-written and semantically rich) can crowd out the sparser but more authoritative court decisions.

The category metadata comes from the documents themselves. When documents are uploaded the client tags them with category, jurisdiction, date, and framework. This metadata gets enriched during retrieval so the LLM sees something like "[Chunk from: EuGH C-300/21 | category: High court decision | region: EU | date: 2023-12-14]" before the actual content.

The prompt engineering was the other half of the battle. I have explicit negative instructions preventing the LLM from doing things like:

  • Citing "according to professional literature" without naming the specific document
  • Writing "(Kategorie: High court decision)" as an inline citation instead of the actual court name
  • Attributing a finding to the wrong authority level (e.g. claiming a lower court said something that was actually from a higher court)
  • Flattening divergent positions into false consensus

Each of these negative instructions was added because I caught the LLM doing exactly that thing during testing.

The takeaway for anyone building domain-specific RAG: think carefully about whether your sources have an inherent reliability hierarchy. If they do, standard vector similarity ranking will mislead your users in ways that are hard to detect without domain expertise.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

A lawyer asked me how to build an AI research assistant for their own practice. here's the honest starting point

After my post about building a RAG system for a German law firm I got DMs from two lawyers and a compliance officer asking how they could build something similar for their own practice.

The honest answer is it depends on what you mean by "build."

If you want a basic version that works for personal research, you can get something running in a weekend. If you want a production system that a whole firm trusts for client work, that's a different conversation.

Here's how I'd think about it at each level:

Level 1: Personal research tool (1-2 days)

Take your documents, upload them to a vector database, wire up an LLM to answer questions with retrieval. You can do this with LangChain and FAISS in maybe 200 lines of Python. It will work okay for simple lookups. It will not handle conflicting sources well. It will not cite properly. It will hallucinate sometimes. But for quick personal research where you double check everything anyway it's useful.

Level 2: Team tool with decent quality (2-4 weeks)

This is where you need to care about chunking strategy. Legal documents can't be chunked naively, you need structure-aware parsing that respects sections and subsections. You need metadata on every document (jurisdiction, date, source type, authority level). You need citation enforcement in the prompts. You need bilingual handling if you work across languages. This is roughly what I built.

Level 3: Production system a firm would bet on (2-3 months)

Everything in level 2 plus access controls, audit logging, retrieval quality monitoring, automated testing, proper error handling, data backup, compliance documentation, and ongoing maintenance. This is where most solo builders underestimate the scope.

Most people asking me this question are at level 1 thinking it gets them to level 3. It doesn't. The gap between "I asked ChatGPT a question with some context" and "the entire firm trusts this for client-facing work" is enormous.

The biggest differences between a demo and production:

  • Citation accuracy. A demo can say "according to legal guidelines." Production must cite the exact document name and article number or it's worthless.
  • Source hierarchy. A demo treats all documents equally. Production needs to know that a high court ruling outweighs a law review article.
  • Failure handling. A demo hallucinates and nobody notices. Production hallucinates and someone sends wrong legal advice to a client.

If you're a lawyer wanting to build level 1 for yourself, go for it. It's a great learning project and useful for daily research.

If you want level 2 or 3 for your firm, you either need to invest serious development time or hire someone who's done it before. That's not gatekeeping, it's just the reality of what production quality requires in a high-stakes domain.

Happy to answer specific technical questions if you're getting started.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

A lawyer asked me how to build an AI research assistant for their own practice. here's the honest starting point

After my post about building a RAG system for a German law firm I got DMs from two lawyers and a compliance officer asking how they could build something similar for their own practice.

The honest answer is it depends on what you mean by "build."

If you want a basic version that works for personal research, you can get something running in a weekend. If you want a production system that a whole firm trusts for client work, that's a different conversation.

Here's how I'd think about it at each level:

Level 1: Personal research tool (1-2 days)

Take your documents, upload them to a vector database, wire up an LLM to answer questions with retrieval. You can do this with LangChain and FAISS in maybe 200 lines of Python. It will work okay for simple lookups. It will not handle conflicting sources well. It will not cite properly. It will hallucinate sometimes. But for quick personal research where you double check everything anyway it's useful.

Level 2: Team tool with decent quality (2-4 weeks)

This is where you need to care about chunking strategy. Legal documents can't be chunked naively, you need structure-aware parsing that respects sections and subsections. You need metadata on every document (jurisdiction, date, source type, authority level). You need citation enforcement in the prompts. You need bilingual handling if you work across languages. This is roughly what I built.

Level 3: Production system a firm would bet on (2-3 months)

Everything in level 2 plus access controls, audit logging, retrieval quality monitoring, automated testing, proper error handling, data backup, compliance documentation, and ongoing maintenance. This is where most solo builders underestimate the scope.

Most people asking me this question are at level 1 thinking it gets them to level 3. It doesn't. The gap between "I asked ChatGPT a question with some context" and "the entire firm trusts this for client-facing work" is enormous.

The biggest differences between a demo and production:

  • Citation accuracy. A demo can say "according to legal guidelines." Production must cite the exact document name and article number or it's worthless.
  • Source hierarchy. A demo treats all documents equally. Production needs to know that a high court ruling outweighs a law review article.
  • Failure handling. A demo hallucinates and nobody notices. Production hallucinates and someone sends wrong legal advice to a client.

If you're a lawyer wanting to build level 1 for yourself, go for it. It's a great learning project and useful for daily research.

If you want level 2 or 3 for your firm, you either need to invest serious development time or hire someone who's done it before. That's not gatekeeping, it's just the reality of what production quality requires in a high-stakes domain.

Happy to answer specific technical questions if you're getting started.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

The gap between legal AI marketing and what actually works in production is wild

I'm a developer who recently built an AI research system for a compliance firm in Europe. Not a SaaS product, just a custom internal tool for one firm. Wanted to share some observations from the experience because the disconnect between how legal AI gets marketed and what actually matters in practice was eye-opening.

The biggest thing I underestimated was citation accuracy. Every legal AI demo I've seen shows a chatbot returning nice-looking answers. Nobody talks about the fact that the AI will confidently attribute a regional court's position to the Supreme Court if you don't specifically engineer against it. I caught this during testing and it took weeks of prompt engineering to get source attribution reliable. Stuff like the model writing "according to professional literature" instead of citing the specific document, or flattening two conflicting court positions into one answer as if there's consensus when there isn't.

The authority hierarchy problem is something I've never seen addressed in any legal AI product marketing. In practice, a high court ruling carries fundamentally different weight than a lower court opinion or a guideline or a law review article. Standard AI retrieval treats them all equally because it just ranks by text similarity. A well-written blog post can outrank an actual binding court decision because the blog uses more natural language. That's dangerous in a way that's hard to detect without domain expertise.

The other thing that surprised me was how much the lawyers cared about regional jurisdiction handling versus how little most AI tools account for it. In Germany you have 16 federal states with variations in how regulations get applied. Documents need to be tagged by jurisdiction and the system needs to flag when something is state-specific vs nationally applicable. None of the generic tools I evaluated before building custom handled this at all.

On the positive side, once the system actually worked properly with accurate citations and authority awareness, adoption within the firm was faster than I expected. The associates who were skeptical became the heaviest users because it genuinely cut their research time from 30-45 minutes per question down to a few minutes.

Curious if others here have had similar experiences with legal AI tools, either good or bad. The space seems to be moving fast but the quality gap between what's marketed and what's actually production-ready feels massive.

reddit.com
u/Fabulous-Pea-5366 — 6 days ago

How are firms handling RAG accuracy for internal document search? Running into some interesting challenges

I've been working on a retrieval augmented generation setup for internal legal document search (case files, memos, regulatory docs) and wanted to get some perspective from others who have dealt with this.

A few things I've been running into that I'm curious if others have solved differently:

  • Chunking legal documents is a nightmare compared to normal text. Nested clauses, footnotes, cross references between sections.. naive chunking just shreds the context. Has anyone found a reliable approach for structured legal docs specifically?
  • Pure vector search has been disappointing for legal text. The language is so precise and keyword dependent that semantic search alone misses a lot. Hybrid approach with BM25 on top has been working better but curious what others are seeing
  • Citation/sourcing seems like the make or break feature. The lawyers I've talked to won't even look at a system that doesn't show exactly which document section an answer came from. Anyone built a good UX pattern for inline citations that actually works in practice?
  • Data privacy concerns are dominating every conversation before anyone even evaluates the tech. Especially with EU based firms where GDPR adds another layer. What infrastructure setups are people using to get past this objection?

Would love to hear from anyone who has deployed something like this in a real firm environment. What worked, what didn't, what would you do differently. Most of the RAG content out there is generic tutorial stuff and not really specific to legal workflows.

reddit.com
u/Fabulous-Pea-5366 — 9 days ago