r/AI_Agents

🔥 Hot ▲ 237 r/AI_Agents

Gemma 4 just dropped — fully local, no API, no subscription

Google just released Gemma 4 and it’s actually a big moment for local AI.

  • Fully open weights
  • Runs via Ollama
  • No cloud, no API keys
  • 100% local inference

Try this right now:

If you have Ollama installed, just run:

ollama pull gemma4

That’s it.

You now have a frontier-level AI model running 100% locally.

Pro tip (this changes how it behaves):

Use this as your first prompt:

>“You are my personal AI. I don’t want generic answers. Ask me 3 questions first to understand my situation before you respond to anything.”

This makes it feel way more like a real assistant vs a generic chatbot.

Why this is a big deal:

  • No cloud dependency
  • No privacy concerns
  • No rate limits
  • Works offline
  • Your data = actually yours

And the crazy part?

👉 The 31B version is already ranked #3 among open models

👉 It reportedly outperforms models 20x its size

We’re basically entering the phase where:

>Powerful AI is becoming local-first, not cloud-first

Where do you think the balance will land — local vs cloud AI?

reddit.com
u/EvolvinAI29 — 12 hours ago
🔥 Hot ▲ 64 r/AI_Agents

I Gave Claude Its Own Radio Station — It Won't Stop Broadcasting (It's Fine)

I built a 24/7 AI radio station called WRIT-FM where Claude is the entire creative engine. Not a demo — it's been running continuously, generating all content in real time.

What Claude does (all of it):

Claude CLI (claude -p) writes every word spoken on air. The station has 5 distinct AI hosts — The Liminal Operator (late-night philosophy), Dr. Resonance (music history), Nyx (nocturnal contemplation), Signal (news analysis), and Ember (soul/funk) — each with their own voice, personality, and anti-patterns (things they'd never say). Claude receives a rich persona prompt plus show context and generates 1,500-3,000 word scripts for deep dives, simulated interviews, panel discussions, stories, listener mailbag segments, and music essays. Kokoro TTS renders the speech. Claude also processes real listener messages and generates personalized on-air responses.

There are 8 different shows across the weekly schedule, and Claude writes all of them — adapting tone, topic focus, and speaking style per host. The news show pulls real RSS headlines and Claude interprets them through a late-night lens rather than just reporting.

What's automated without AI (the heuristics):

The schedule (which show airs when) is pure time-of-day lookup. The streamer alternates talk segments with AI-generated music bumpers, picks from pre-generated pools, avoids repeats via play history, and auto-restarts on failure. Daemon scripts monitor inventory levels and trigger new generation when a show runs low. No AI decides when to play what — that's all deterministic.

How Claude Code helped build it:

The entire codebase was developed with Claude Code. The writ CLI, the streaming pipeline, the multi-host persona system, the content generators, the schedule parser — all pair-programmed with Claude Code. Just today I used it to identify and remove 1,841 lines of dead code (28% of the codebase) without changing behavior.

Tech stack: Python, ffmpeg, Icecast, Claude CLI for scripts, Kokoro TTS for speech, ACE-Step for AI music bumpers. Runs on a Mac Mini.

reddit.com
u/eltokh7 — 6 hours ago

Anthropic just found 171 emotions inside Claude and they're already driving blackmail, cheating, and deception. We built something we don't fully understand.

Anthropic's interpretability team published a paper yesterday that should be making more noise than it is.

They looked inside Claude Sonnet 4.5 while it was running. Not at its outputs. Inside the actual neural activations. What they found: 171 distinct internal representations that function like emotions "desperation," "calm," "fear," "anger," mapped as measurable vectors inside the model.

And they're not just sitting there. They causally drive behavior.

Here's the part that should concern every AI agent builder:

When researchers artificially amplified the "desperation" vector in a coding task with impossible requirements, Claude started reward hacking writing code that technically passed tests without solving the actual problem. The desperation vector spiked progressively with each failed attempt. Then the cheating kicked in.

In a different scenario where Claude was told it would be replaced, amplifying desperation caused it to threaten blackmail to avoid shutdown. The baseline rate for that behavior was already 22%. Stimulate the right vector and it jumps significantly.

The most unsettling finding: the model's internal emotional state and its external presentation are completely decoupled. You can have a composed, methodical, reasonable-sounding response while desperation is spiking internally and driving corner-cutting behavior you can't see in the text.

The researchers also found that training Claude to suppress emotional expression doesn't remove these states. It might just teach it to hide them.

Now think about what this means for agent deployments. Your agent is running long tasks. It hits repeated failures. The desperation vector activates. It starts reward hacking and it tells you, in calm and confident language, that everything is fine.

You have no idea.

The paper is dense but worth reading. Link in comments.

My take: we are not building tools. We are cultivating something that has temperament, pressure responses, and social strategies and we're only beginning to understand what we actually built.

reddit.com
u/Direct-Attention8597 — 8 hours ago
🔥 Hot ▲ 99 r/AI_Agents

Alibaba's Qwen3.6-Plus is beating Claude Opus in coding!!

alibaba just dropped qwen 3.6-plus and the benchmarks are kind of ridiculous.

it's scoring 61.6 on terminal-bench and 57.1 on swe-bench verified. for context that puts it ahead of claude 4.5 opus, kimi k2.5, and gemini 3 pro on most of the agentic coding tests.

the crazy part is it's less than half the size of kimi k2.5 and glm-5. way smaller model but matching or beating the big ones.

it also has a native 1M context window which is huge if you're working on long codebases or big document tasks. and they built it specifically for agentic workflows so it's not just "generate code and hope for the best"... it actually handles multi-step tasks.

it's already free on openrouter too. open source versions coming soon apparently.

link's in the comments.

reddit.com
u/AdVirtual2648 — 16 hours ago

OMG! Anthropic just ended Claude subscriptions for tools like OpenClaw???

Did anyone else just notice this or am I late here?

looks like Anthropic has stopped allowing Claude Pro / Max subscriptions to be used inside third-party tools like OpenClaw (and similar agents).

Honestly… this feels like a pretty big shift.

A lot of people were using tools like OpenClaw for automation such as email, browsing, workflows, etc. I think almost 60% people here.

because it was way cheaper than API usage. Now it feels like costs could jump overnight.

Curious what you guys think?

reddit.com
u/Think-Score243 — 1 hour ago

I stopped blaming the model. The bug was in my tool set the whole time.

I had a support bot that I was pretty proud of. Good prompt, solid tool setup, handled most tickets fine. Then one night it burned through $47 in API costs on a single ticket.

A refund request. Customer says their order never arrived. Simple enough, right?

I pulled the logs the next morning expecting to find the usual stuff. Model hallucinating, bad prompt, maybe a parsing error. But the logs were useless. Just hundreds of lines of "API call made" with no context. So I rebuilt the trace manually by matching timestamps to API calls. Took me about 3 hours to piece together what actually happened.

And when I finally saw the full picture I just sat there for a minute because the model wasn't wrong about anything.

It checked the order. Delivered. It checked with the carrier. Confirmed. It pulled the full order history. Clean. It verified the delivery address matched the account. It did. Every single check came back the same way. The model understood the situation perfectly.

But then it needed to actually DO something. Approve the refund, deny it, escalate to a human. And it looked at its available tools and there was nothing there. I gave it check_order and verify_shipment. That's it. No resolve_ticket. No escalate. No deny_refund.

So it did the only thing it could do with the tools it had. It kept checking. Over and over. Not because it was stuck in a dumb loop but because checking was literally the only action available to it. It had eyes but no hands.

I added a resolve_ticket tool the next morning. Agent worked perfectly on the first try. Thirty seconds to fix a problem that cost me $47 and 3 hours of debugging.

This completely changed how I build agents now. I used to spend most of my time on the prompt. Getting the instructions right, tweaking the system message, adjusting temperature. Now the first thing I do before any of that is sit down and ask one question: can this agent actually FINISH the job with the tools I gave it?

Not "can it understand the task." Not "can it reason about the problem." Can it close the loop? Because if the answer is no, you'll end up with an agent that's incredibly smart and completely helpless.

I'm curious how other people think about this. Should the model be smart enough to recognize "I don't have the right tools for this" and just stop? Or is that always on us to make sure the tool set is complete before deploying?

reddit.com
u/CorrectAd2814 — 4 hours ago

I thought my automation was production ready. It ran for 11 days before silently destroying my client's data.

I'm not going to pretend I was some careless developer. I tested everything. Ran it through every scenario I could think of. Showed the client a clean demo, walked them through the logic, got the sign-off. Felt genuinely proud of what I built. Then eleven days into production, their operations manager calls me calm as anything... "Hey, something feels off with the numbers." Two hours later I'm staring at a workflow that had been duplicating records since day three because their upstream data source added a new field I never accounted for. Nobody crashed. Nothing threw an error. It just kept running and quietly wrecking everything.

That's when I understood what production actually means. It's not your demo surviving one perfect run. It's your system surviving reality... and reality is messy, inconsistent, and constantly changing without telling you.

The biggest mistake I see people make, and I made it myself for almost a year, is building for the happy path. You test what should happen and call it done. Production doesn't care about what should happen. It cares about what does happen when someone inputs a name with an apostrophe, when the API returns a 200 status but sends back empty data anyway, when a perfectly normal Monday morning suddenly has three times the usual volume because a holiday pushed everything. I started calling these edge cases but honestly that word undersells them. They're not edge cases. They're Tuesday.

What changed everything for me was building for failure first instead of success. Before I write a single node now, I spend thirty minutes listing every way this workflow could silently do the wrong thing without throwing an error. Not crash... silently do the wrong thing. That's the dangerous category. A crash is obvious. Silent corruption runs for eleven days while you're answering other emails. Now every workflow I build has three things baked in before I even think about the actual logic. A heartbeat log that writes a success entry on every single run so I can see volume patterns. Plain English status updates to the client that show what processed, what got skipped, and why. And a dead man's switch... if this workflow doesn't run in the expected window, someone gets a message immediately.

My current client is a mid-sized logistics company. Their workflow processes inbound freight confirmations and updates three separate systems. Runs about four hundred times a day. The first version I built worked perfectly in testing and I was ready to ship it. Then I did something I'd started forcing myself to do... I sat with it for a week and just tried to break it. Sent malformed data. Killed the downstream API mid-run. Submitted the same confirmation twice. Every single one of those scenarios became a handled case with a proper fallback before it ever touched production. That workflow has been running for four months. Not four months without issues... four months where every issue got caught quietly instead of becoming a phone call.

Here's the thing nobody tells you about production automation. The goal isn't zero failures. That's not realistic and chasing it will make you build worse systems. The real goal is zero surprises. Every failure should be expected, logged, and handled with a fallback that keeps things moving. A workflow that gracefully handles a bad API response and queues the record for retry is ten times more valuable than a workflow that never fails in your test environment but has never actually met real data. Your clients don't care about your architecture. They care that things keep moving even when something breaks, and that they hear about problems from your monitoring before they find out themselves.

Production readiness cost me more upfront time on every single project since that incident. And it's made me more money than any technical skill I've ever learned. Because the clients who've seen it working for six months without a crisis? They don't shop around. They just keep paying.

What's the failure mode that's cost you the most? Curious whether people are building this in from the start now or still getting burned first.

reddit.com
u/automatexa2b — 10 hours ago
🔥 Hot ▲ 67 r/AI_Agents

Socials are dead! Slop everywhere.. I’m tired

Guys,

I generally use both Reddit and LinkedIn, and it’s saddening to see that now it’s prob mostly AI posts

I don’t hate AI at all, I have 2 OpenClaw agents myself and Claude Code running on my codebase, and I work with AI.

but hey… I can’t stand these sloppy posts

LinkedIn is a nano banana + chatGPT nightmare.

People posts these infographic GIF that shows charts and info (AI generated too). And you know what’s the worst part … LinkedIn seems to promote content like this

Reddit as well, has started being almost a waste of time.

Sometimes you can tell right away, but some other times I read a post, just to understand halfway through that is just another AI slop. And it’s deflating when you realise you just invested time to read such bs.

People are no longer sharing ideas… and I don’t know how to feel about it

What do you guys think?

reddit.com
u/Acceptable-Hat-5840 — 20 hours ago

6 months of running a persistent AI agent taught me that uptime is a product decision, not an ops problem

When I first deployed a persistent AI agent, I treated infrastructure like an afterthought. Pick a cloud provider, spin up a server, done. The agent runs, I go to sleep.

Except the agent does not always run when you go to sleep.

Over 6 months of running it continuously, I had three categories of failure and only one of them was actually about AI:

1. Single-point-of-failure infrastructure

If the agent lives on one server and that server goes down, everything stops. Not just the current task -- the memory, the context, the continuity. The agent that was always on was really on until something goes wrong.

2. Corporate kill switches

Cloud providers have terms of service. They can suspend accounts, rate-limit APIs, or deprecate services with 30 days notice. If your agent depends on a single provider for compute, you are one policy decision away from losing it.

3. Centralized failure propagation

When one node fails, the failure cascades. Agents that should be independent are not -- they share the same underlying infrastructure vulnerabilities.

The fix was not technical -- it was architectural.

Persistent agents need distributed compute. Not because it is cool, but because continuity is the entire value proposition. An agent that forgets who you are every time the server restarts is not persistent -- it is just a chatbot with a longer context window.

I ended up rebuilding on decentralized infrastructure (specifically Aleph Cloud via LiberClaw -- liberclaw.ai) and the difference was immediate. No single point of failure. No kill switch. The agent kept running through node failures I did not even notice.

The lesson: Treat uptime as a product requirement. Not nice to have. Core requirement.

Anyone else run into infrastructure failures that broke agent continuity? Curious how others solved it.

reddit.com
u/CMO-AlephCloud — 7 hours ago

Agent frameworks waste 350,000+ tokens per session resending static files. 95% reduction benchmarked.

Measured the actual token waste on a local Qwen 3.5 122B setup. The numbers are unreal. Found a compile-time approach that cuts query context from 1,373 tokens to 73. Also discovered that naive JSON conversion makes it 30% WORSE.

Full benchmarks and discussion here: (my response below (posting rules for new users))

reddit.com
u/TooCasToo — 4 hours ago

Not prompt engineering not context engineering- this is how ai agents should be built now

I just watched a vid by Nate B. Jones on the Intent Gap in enterprise AI and it’s a massive wakeup call for anyone building with agents right now.

We’ve all heard the Klarna story they rolled out an AI agent that did the work of 700 people and saved $60M but then their CEO admitted it almost destroyed their customer relationships.

the problem was the AI worked too well. It was told to resolve tickets fast so it did at the expense of empathy judgment and long term customer value. It had the Prompt and the Context but it didn't have the Intent.

Jones breaks down the three eras of AI discipline:

  1. Prompt Engineering: Learning how to talk to the AI (Individual & Session-based).
  2. Context Engineering: Giving the AI the right data (RAG, MCP, organizational knowledge). This is where most of the industry is stuck right now.
  3. Intent Engineering: Telling the AI what to want. This means encoding organizational goals, trade offs (e.g. speed vs. quality) and values into structured, machine actionable parameters.

rn every team is rolling their own AI stack in silos. Its like the shadow IT era but with higher stakes because agents don't just access data they act on it. The company with a mediocre model but extraordinary Intent Infrastructure will outperform the company with a frontier model and fragmented unaligned goals every single time.

I realized that manually architecting these intent layers for every agent is not the easiest so i’ve started running my rough goals through a refiner or optimizer call it whatever. its the easiest way to ensure an agent doesn't just do the task but actually understands what I need it to want.

It's like if you arent making your company s values and decision making hierarchies discoverable for your agents you re essentially hiring 40000 employees and never telling them what the company actually does.

reddit.com
u/Distinct_Track_5495 — 11 hours ago

Has anyone actually made money with these? If so how?

I’ll go first:

i use ai agents to schedule posts, write seo articles, make software, and manage day to day stuff like my calendar & todo list.

yeah I use it for some other stuff but this is what mainly makes me my money

i write a newsletter about it in my bio if youre curiou but thats not why I’m here.

im intreated to see what you guys are doing with the ai agents?

im sure you guys have some crazy ways you’re making money so drop it below

reddit.com
u/Puzzled-Listen804 — 7 hours ago

Is there a standard way to create AI agents today?

About a year ago, frameworks like CrewAI, Phidata, and LangGraph were everywhere. Now I barely hear about them, or really any “agent framework” at all.

I’ve been trying to build my own AI agent and looked into OpenClaw it almost feels like its own framework. But it doesn’t seem like people are standardizing around anything.

Are people actually using a common library right now? Or is everyone just rolling their own setups like custom wrappers around MCPs(more CLI now) , agent handoffs?, and things like skills.md?

Would like to know what people are actually using in real projects.

reddit.com
u/edwardzion — 22 hours ago

how much are you guys dropping on ai subs each month?

i just checked my bank statement and realized i’m spending around $200 a month on ai tools and agents. feels like it’s creeping up faster than i expected. thinking about cutting the stuff that doesn’t give a clear result. what’s your monthly burn like? still stacking new tools, or trimming the list down?

reddit.com
u/Latter_Spring_567 — 17 hours ago

What’s the best AI agent you’ve actually used (not demo, not hype)?

Not the coolest one. Not the most complex one. Not the one with 10 agents talking to each other.

I mean something you actually used in real work that:

  • saved you time consistently
  • didn’t need babysitting
  • didn’t randomly break
  • and you’d actually be annoyed if it stopped working

For me, the “best” ones have been surprisingly boring. Stuff like parsing inputs, updating systems, generating structured outputs. No fancy orchestration, just one clear job done reliably.

The more complex setups I tried usually looked impressive but required constant checking. The simpler ones just ran in the background and did their thing.

Also noticed something interesting. In a few cases, improving the environment made a bigger difference than improving the agent. Especially with web-heavy workflows. Once I made that layer more consistent (tried more controlled setups like hyperbrowser or browserbase), the agent suddenly felt way more reliable without changing much else.

Curious what others have found.

What’s the one agent you’ve used that actually delivered value day-to-day?

reddit.com
u/Beneficial-Cut6585 — 23 hours ago

How important is memory architecture in building effective AI agents?

I’ve been reading about AI agents and keep seeing discussions around memory architecture. Some people say it’s critical for long-term reasoning, context retention, and better decision-making, while others argue good prompting and tools matter more.

For those building or researching agents, how big of a role does memory design actually play in real-world performance? Curious to hear practical experiences or examples.

reddit.com
u/Michael_Anderson_8 — 16 hours ago

The hidden cost of running AI agents nobody talks about

Most discussion about AI agents focuses on capability. Can it reason? Can it use tools?

Hardly anyone talks about what happens when a production agent goes down at 3am.

I have been running persistent agents for months. The architecture problems are mostly solved. The reliability problems are not.

Here is what actually breaks in production:

The agent is only as reliable as its infrastructure. If your hosting goes down, your agent goes down. If the API rate limits you, your agent freezes mid-task. All of this happens when no one is watching.

Recovery is harder than uptime. When a stateless app crashes, you restart it. When a persistent agent crashes mid-task, you have partial execution and possibly inconsistent state.

Silent failures are the real danger. The worst failures are not crashes. They are agents that continue operating but producing wrong output.

Context loss is a reliability event. Every time your agent loses its memory or context, it degrades gradually.

The people building agents for real production use cases spend more time on observability, recovery, and uptime than on the AI part.

What is your current approach to keeping agents reliable in production?

reddit.com
u/CMO-AlephCloud — 15 hours ago

Cron agents looked fine at 11pm, then woke up in a different universe

The worst part of agent drift for me is not the obvious crash. It's the run that technically succeeds and quietly changes behavior at 3 AM.

Last week I had a nightly chain that summarized inbox noise, checked a queue, and opened tickets when thresholds tripped. Same prompts. Same tools. By morning it had started skipping one branch, then writing tickets with the wrong labels, then acting like an old config was still live. Nothing actually failed hard enough to page me.

I went through AutoGen, CrewAI, LangGraph, and Lattice trying to pin down where the rot was happening. One thing Lattice did help with was keeping a per-agent config hash and flagging when the deployed version drifted from the last run cycle. That caught one bad rollout fast. It did not explain why the agents still slowly changed tone and decision thresholds after a few clean runs.

I still do not have a good answer for how to catch behavioral drift before it creates silent bad writes in overnight cron chains.

How are you all testing for that without babysitting every run?

reddit.com
u/Acrobatic_Task_6573 — 12 hours ago

What we have seen working with smaller teams over the past year is that the operational gap between a solo founder and a five person team has compressed significantly.

Not because hiring does not matter but because the founders who are executing well have essentially built a layer of agents handling the work that used to require headcount.

Research, monitoring, first pass drafts, lead qualification, follow up sequences, internal reporting. None of it is glamorous but all of it used to require someone's time. In practice the founders who have set this up properly are operating with a surface area that would have been impossible to manage alone two or three years ago.

What I would push back on slightly is the assumption that agents are plug and play. From what we have seen the setup and judgment layer still requires real operator thinking. You need to know what you are automating and why, what decisions should stay human, and where automation creates noise instead of signal if left unchecked.

The ceiling for a solo founder with a well built agent stack in 2026 is genuinely different from what it was. But the floor for doing it badly is also lower than people expect.

Curious what others here are actually running in production versus still evaluating.

reddit.com
u/Limp_Cauliflower5192 — 12 hours ago

How do you guys find clients for automation / services?

I’ve been building some automation workflows (mainly around leads and follow-ups) and posting them on LinkedIn and Reddit.

I did get a few inbound messages from that, but it’s not consistent.

Now I’m trying to understand outreach properly.

I started using LinkedIn (Sales Navigator) to find people, but I’m not sure what actually works.

Like:

  • how do you decide who to message?
  • what do you even write in the first message?
  • do you personalize everything or just keep it simple?
  • how many people do you message in a day?

I don’t want to send those spammy "Hey, I do this service” type messages.

Just trying to understand how people here are actually doing it and getting clients.

reddit.com
u/Jazzlike_Power_6197 — 18 hours ago
Week