r/LangChain

How are you guys safely giving agents API access without giving them "God Mode"? (The OAuth 'All-or-Nothing' trap)

We’ve been building multi-agent orchestration systems with LangGraph, and binding tools to agents is incredibly easy. But the moment we try to connect those tools to a user's sensitive data in production, the standard OAuth model completely breaks down.

Take a Gmail integration: If I want a LangChain agent to simply draft an email reply, Google’s standard OAuth forces me to request scopes that also grant the permission to Send and Delete emails. It’s an all-or-nothing trap.

System prompts are not a real security boundary, and Human-in-the-loop defeats the purpose of autonomous background tasks.

After 13 years of building enterprise SaaS, I got so frustrated by this that our team stopped building the agentic app itself and started building the infrastructure to fix it. We are engineering an Agent Access Security Broker (AASB)—a B2B proxy layer that sits between the agent's tool calls and the user's data so developers can enforce strict boundaries (like a hard "Draft-Only" lock).

Before we go deeper into this architecture, I want to know how the LangChain community is currently hacking around this.

  • Are you rolling your own custom middleware to intercept tool calls?
  • Restricting scopes at the API gateway level?
  • Or just relying on HITL?

Would love to hear your approaches.

reddit.com
u/Apart_Mix990 — 22 minutes ago
We're running a 4-week hackathon series with $4,000 in prizes, open to all skill levels!
▲ 6 r/aiagents+1 crossposts

We're running a 4-week hackathon series with $4,000 in prizes, open to all skill levels!

Most hackathons reward presentations. Polished slides, rehearsed demos, buzzword-heavy pitches. You can win without shipping anything real.

We're not doing that.

The Locus Paygentic Hackathon Series is 4 weeks, 4 tracks, and $4,000 in total prizes. Each week starts fresh on Friday and closes the following Thursday, then the next track kicks off the day after. One week to build something that actually works.

Week 1 sign-ups are live on Devfolio.

The track: build something using PayWithLocus. If you haven't used it, PayWithLocus is our payments and commerce suite. It lets AI agents handle real transactions, not just simulate them. Your project should use it in a meaningful way.

Here's everything you need to know:

  • Team sizes of 1 to 4 people
  • Free to enter
  • Every team gets $15 in build credits and $15 in Locus credits to work with
  • Hosted in our Discord server

We built this series around the different verticals of Locus because we want to see what the community builds across the stack, not just one use case, but four, over four consecutive weeks.

If you've been looking for an excuse to build something with AI payments or agent-native commerce, this is it. Low barrier to entry, real credits to work with, and a community of builders in the server throughout the week.

Drop your team in the Discord and let's see what you build.

discord.gg/locus | paygentic-week1.devfolio.co

u/IAmDreTheKid — 23 hours ago
▲ 3 r/LangChain+1 crossposts

Scanned 577 open-source AI agent repos. 86% have serious bugs. The main issue isn't prompt injection...

Spent a few weeks scanning every AI agent repo on GitHub with 20+ stars, LangChain, CrewAI, AutoGen, pydantic-ai, MCP servers, n8n, all of it. 

Expected prompt injection everywhere. Found something more boring: infinite loops. 5,397 of them. Plan → act → observe cycles with no max iterations, no timeout, no kill switch. 

Other top findings: missing audit logging (1,371), missing rate limits (938), unsafe exec/eval (573).

Full data + methodology: inkog.io/report

Honest question to the people actually shipping agents here: do you set a max_iterations when you ship? A timeout? Or is "just let it run and hope" the default?  

reddit.com
u/Revolutionary-Bet-58 — 3 hours ago

We're considering moving our production agent to LangChain from Google ADK. Thoughts?

General concerns we have it seeing that other agents built with LangChain/LangGraph or even the OpenAI Agents SDK seem to have better latency than our Google ADK agents.

I understand a lot about latency is about the infra. We have it on a fairly standard stack (Railway + Python, Supabase + SQLAlchemy, Vercel + Nextjs) - so unless I'm missing something huge about our infra, we're thinking about translating.

I would love the thoughts of people who have built with both, though.

reddit.com
u/mee-gee — 2 hours ago

Looking for people to build AI agents.

Hello guys. I am a software developer with 1 YOE. I am working on a side project. I am making an AI agent. I have just done some POC yet. I am looking for someone truly passionate and a little skilled.

I have planned making an agent which will take user input like "plan a trip to Goa under 20k" and will extract details from user query and keep asking for missing details unless fully satisfied. After that it will fill all the details and will call appropriate tools like fetch_flights, fetch_weather for those dates etc.

This agent will continuously keep human in loop. It will keep asking for confirmations, human can prompt anything in between like increase budget from 20k to 30k. Then it will adjust the upcoming plan according to that.

I have already built mock tools. Which will help us complete it fast. Later we can integrate real tools.

This is one project idea I have. I am open to other better ideas if anyone have.
Lets discuss in comments and build something big which will shine in our resumes and maybe used as a Saas later.

Skills preferred:

FastAPI (or any backend framework)

Langchain, Langgraph, Langsmith.

System design skills (most imp).

reddit.com
u/loop_seeker — 7 hours ago

Karis CLI vs LangChain for production automation: a practical comparison

I've built production agents with LangChain and I've been testing Karis CLI. Here's my honest comparison for "boring but real" automation tasks.

LangChain is flexible, lots of integrations, but the abstraction layers can make debugging painful. When something goes wrong in a chain, it's hard to know which layer failed.

Karis is more opinionated (3-layer architecture), but the layers are explicit. Runtime tools are just code. Orchestration is planning. Task management is state, Failures are easier to diagnose

For exploration and prototyping, LangChain's flexibility is nice. For production automation that needs to be reliable and auditable, Karis CLI's structure is more comfortable.

I'm not saying one is better—they're different tools for different stages. But if you're tired of debugging LangChain chains, Karis CLI's explicit layers might be a relief.

reddit.com
u/Larry_Potter_ — 3 hours ago
A lightweight hallucination detector for RAG (catches contradictions without an LLM-as-a-judge)

A lightweight hallucination detector for RAG (catches contradictions without an LLM-as-a-judge)

Hey everyone,

If you’re building RAG apps, you’ve probably hit this wall: your retrieval is perfect, you feed the right context to the LLM, but the LLM still subtly misrepresents the facts in its final answer.

Evaluating this usually sucks. You either have to rely on expensive LLM-as-a-judge APIs (like sending it back to GPT-4 to check itself) or deal with bulky evaluation frameworks that are hard to run locally.

To solve this, we just open-sourced LongTracer. It's a lightweight Python package that checks the LLM's response against your retrieved documents and flags any hallucinated claims—all locally, without API keys.

How simple it is to use:

You just pass in the LLM's answer and your source documents:

Python

from longtracer import check

result = check(
    "The Eiffel Tower is 330m tall and located in Berlin.",
    ["The Eiffel Tower is in Paris, France. It is 330 metres tall."]
)

print(result.verdict)             # FAIL
print(result.hallucination_count) # 1

If you use LangChain, you can instrument your whole pipeline in one line:

Python

from longtracer import LongTracer, instrument_langchain

LongTracer.init(verbose=True)
instrument_langchain(your_chain) 

Why we built it this way:

  • No API Costs: It runs small, local NLP models to verify facts, so you don't have to pay just to check if your bot is lying.
  • Zero Infrastructure: It takes plain text strings. No need to hook it up to your vector database.
  • Automatic Logging: It automatically logs all traces and hallucination metrics to SQLite (default), Mongo, or Postgres.

It also comes with a CLI to generate HTML reports of your pipeline runs.

It’s MIT licensed and available via pip install longtracer.

The code and architecture details are on GitHub if you want to test it on your pipelines:https://github.com/ENDEVSOLS/LongTracer

We are actively looking for feedback on how to make this more useful for production workflows, so let me know what you think!

u/UnluckyOpposition — 19 hours ago

Long-running agents keep forgetting the boring rules

Most of my pain is not getting an agent workflow to work once. It is getting the same workflow to behave on day two.

The failure mode I keep seeing is guardrail decay. Early runs respect the boring stuff: file boundaries, tool order, retry limits, no-write zones. Then the chain accumulates summaries, patches, and little bits of self-generated context. It still completes tasks. It just starts making slightly bolder choices each cycle. Nothing dramatic. A skipped check here. An unnecessary tool call there. Then a cron wakes up to a workflow that technically ran but drifted far enough to be unsafe.

Longer prompts did not fix it. More memory made it worse. The best results so far came from pinning non-negotiable rules outside the live context, hashing config between runs, and forcing each step to re-read the narrow state it actually needs instead of the whole story.

I still have not found a clean way to stop compressed history from laundering bad assumptions into the next cycle.

How are you all catching guardrail decay before it turns into a quiet failure?

reddit.com
u/Acrobatic_Task_6573 — 7 hours ago

Whats the best framework for building agents in javascript?

I am a javascript developer trying to build a simple AI agent for customer support. Langchain feels like way too much and the python bias is real lol. I want to build agents in javascript

reddit.com
u/jengle1970 — 10 hours ago

Langgraph: Node vs Graph Evaluation

Hi all,

I'd love to hear your take on the approach to evaluate a langgraph graph, both offline during development and online during production.

A. Background

  1. I recently built a POC with langgraph to perform a complex workflow on company long-form documents. There are quite a number of nodes to produce relatively acceptable final outputs, from content detection, reasoning, applying business knowledge, classification, structure output...

  2. The final outputs need to contain a nested JSON, which combines different structured outputs from different worker nodes.

B. Challenges

  1. As this is a new use case, there's no prior ground truth dataset. I need to bootstrap some high-level evaluation sets for just sampling and vibe checking the final outputs.

  2. Evaluating final outputs proves to be insufficient, because an error can propagate from intermediate nodes, while there's nothing wrong with other nodes.

  3. Designing test cases to evaluate the final outputs is challenging because of the highly nestes structure, which can be subjected to changes.

C. What I'm trying now:

  1. Building custom wrappers to evaluate each node. The scorers can be LLM judges or code-based.

  2. The evaluation process is similar to evaluating a MLflow model, where I can log the prompts, the evaluation metrics, datasets...

  3. I can examine the scorer evaluation to gradually create a golden dataset for reference-based evaluation. this would unavoidably take effort from the business side. If I have 10 LLM nodes, I'd need 10 evaluation datasets. only the 1st few nodes, at best, will take advantage of the business input, the rest may need custom inputs for test cases.

D. My questions:

  1. I can see some merits of node-based evaluation, but I also foresee the big effort in repeatedly doing it for all nodes. There may be changes to a node logic or output structure, hence its evaluation logic and golden set can be subjective to changes, adding more effort. Do you think it's a worthwhile idea?

  2. Is there a more efficient approach to do graph evaluation?

  3. Am I overlooking or missing on anything?

reddit.com
u/Careless_Handle8112 — 19 hours ago
Image 1 — How I solved "Conflict of Laws" in a financial RAG — 
ITA 1961 vs ITA 2025 parallel retrieval with 
graceful degradation [with screenshots]
Image 2 — How I solved "Conflict of Laws" in a financial RAG — 
ITA 1961 vs ITA 2025 parallel retrieval with 
graceful degradation [with screenshots]
Image 3 — How I solved "Conflict of Laws" in a financial RAG — 
ITA 1961 vs ITA 2025 parallel retrieval with 
graceful degradation [with screenshots]
Image 4 — How I solved "Conflict of Laws" in a financial RAG — 
ITA 1961 vs ITA 2025 parallel retrieval with 
graceful degradation [with screenshots]
Image 5 — How I solved "Conflict of Laws" in a financial RAG — 
ITA 1961 vs ITA 2025 parallel retrieval with 
graceful degradation [with screenshots]

How I solved "Conflict of Laws" in a financial RAG — ITA 1961 vs ITA 2025 parallel retrieval with graceful degradation [with screenshots]

Previous posts covered the 8-node LangGraph architecture and table extraction. This one is about a different problem I hadn't seen discussed here:

What happens when two valid versions of the same law exist simultaneously?

India currently has: - Income Tax Act 1961 (still operative) - Income Tax Act 2025 (new regime, FY 2026-27) Both are valid. Both answer "tax slab" queries differently. A naive RAG picks one. Mine picks both and reconciles.

Parallel-Firing Intent Classifier: Node 1 (Classifier) doesn't just route — it fires multiple retrieval intents simultaneously:

→ ITA 1961 namespace

→ ITA 2025 namespace

Chunk-level metadata tags resolve which regime applies to the specific query Version conflict resolved before LLM generates.

Generator receives pre-reconciled context. --- Two honest behaviors — both intentional:

Behavior 1 — Document indexed (screenshot): - Section 392 TDS on Salary

- 8 sources cited, page-level attribution - ITA 1961 + ITA 2025 cross-referenced - 61% confidence score - Response grounded 100% in retrieved chunks

Behavior 2 — Document NOT indexed (screenshot):

- 0 chunks fetched - No hallucination, no fake slabs

- Graceful degradation: general knowledge used transparently, "official context unavailable" flagged explicitly - User not left empty-handed, not given dangerous data.

This is intentional two-tier architecture: - Render free tier: light index, production stable - Local 16GB: full Acts indexed, heavy retrieval

>Note: That italic text in the "Agentic Logic" box — that's not UI decoration. That's the Classifier node's real-time Chain-of-Thought firing before any retrieval happens.
Most RAG systems are black boxes — query goes in, answer comes out, you have no idea why. This exposes the reasoning layer:
- What the query intent is
- Which Act to target
- What retrieval scope to apply
This is Agentic Reasoning, not just routing.

AMA on the conflict resolution logic or the graceful degradation implementation.

u/Lazy-Kangaroo-573 — 14 hours ago

Built an OpenAI-compatible API reverse proxy — opening for community stress testing for ~12hrs (GPT-4.1, o4-mini, TTS)

Hey Devs,

I've been building a personal, non-commercial OpenAI-compatible reverse proxy gateway that handles request routing, retry logic, token counting, and latency tracking across multiple upstream endpoints.

Before I finalize the architecture, I want to stress test it under real-world concurrent load — synthetic benchmarks don't catch the edge cases that real developer usage does.

Available models:

  • gpt-4.1 — Latest flagship, 1M context
  • gpt-4.1-mini — Fast, great for agents
  • gpt-4.1-nano — Ultra-low latency
  • gpt-4o — Multimodal capable
  • gpt-4o-mini — High throughput
  • gpt-5.2-chat — Azure-preview, limited availability
  • o4-mini — Reasoning model
  • gpt-4o-mini-tts — TTS endpoint

Works with any OpenAI-compatible client — LiteLLM, OpenWebUI, Cursor, Continue dev, or raw curl.

To get access:

Drop a comment with your use case in 1 line — for example: "running LangChain agents", "testing streaming latency", "multi-agent with LangGraph"

I'll reply with creds. Keeping it comment-gated to avoid bot flooding during the stress test window.

What I'm measuring: p95 latency, error rates under concurrency, retry behavior, streaming reliability.

If something breaks or feels slow — drop it in the comments. That's exactly the data I need.

Will post a follow-up with full load stats once the test window closes.

(Personal project — no paid tier, no product, no affiliate links.)

reddit.com
u/NefariousnessSharp61 — 14 hours ago

LangChain performance bottlenecks and scaling tips?

Been wrestling with this myself. Found vector DB queries getting slow at scale – switched to a FAISS index with GPU acceleration which helped a lot. For larger jobs, distributing the processing across multiple GPUs using OpenClaw significantly cut down completion time (think hours down to minutes for finetuning a large dataset).

reddit.com
u/lewd_peaches — 22 hours ago

Having some problem in langchain4j

when trying to split data in Java class using first converting to string then putting it inside Document then using DocumentSplitter(500,50,tokenizer) having some problem using Tokenizer tokenizer=new GoogleAiGeminiTokenizer(apikey); red line error under Tokenizer and the GoogleAiGeminiTokenizer when clicking ctrl space even then not showing any class to import I have put langchain4j 1.12.2 version cause in the older version there was bug in the 0.35.0 but still it is not recognising the Tokenizer and all

what to do

reddit.com
u/RelationshipFar2187 — 15 hours ago

Why we stopped using vector-only retrieval for agent memory (and what we use instead)

when we first built persistent memory into our agent pipeline, we went with vector search — pgvector, cosine similarity, retrieve top-k on each turn. Standard setup, works well, easy to reason about.

It held up fine during development. Started failing in predictable ways in production.

The failure modes we hit:

Exact keyword recall. User asks "what API key prefix did I set for staging?" The stored memory has sk-stg-0041 in it. Vector search on "API key prefix staging" will sometimes surface this — but as the memory store grows and you have dozens of API-related entries, the similarity scores cluster too tightly for reliable ranking. The specific identifier isn't semantically encoded in the embedding. BM25 finds it trivially.

Rare proper nouns. Any specific framework name, company name, or custom identifier that the embedding model hasn't seen enough of doesn't cluster cleanly. Vector search on "Graphiti" doesn't reliably retrieve memories containing the word "Graphiti" unless it happens to sit near semantically similar tokens. BM25 is O(1) on this — it's a string match.

Density at scale. Vector search degrades as the store grows. More memories = more neighbors = noisier retrieval. You can add metadata filtering (by user, recency, topic) but it's a mitigation, not a fix. The precision tail keeps getting worse.

The fix: hybrid retrieval with RRF

We now run vector search and BM25 (via PostgreSQL tsvector) in parallel and merge using Reciprocal Rank Fusion.

typescript

const [vectorResults, bm25Results] = await Promise.all([
  vectorSearch(query, userId),
  keywordSearch(query, userId)
]);
return reciprocalRankFusion(vectorResults, bm25Results);

RRF formula: score = Σ 1 / (k + rank_i) where k=60. Results appearing in both lists get boosted. Results ranking high in one but absent from the other still surface.

The tsvector column is kept updated via a PostgreSQL trigger so there's no separate indexing pipeline.

Running both queries concurrently means the latency hit is ~max(vector_latency, bm25_latency), not the sum. In practice, both run fast enough that the retrieval step stays well under 100ms at p95.

For higher-stakes retrieval (e.g. customer support where a wrong recall causes a real problem), we add a cross-encoder reranker over the top 20 candidates. Adds 30–80ms but meaningfully improves precision on single-hop factual queries.

Anyone else gone down this path? Curious what retrieval setups people are running at scale.

reddit.com
u/alameenswe — 12 hours ago
Week