
u/aistranin

f you want a structured way to learn agent development without starting from random blog posts, Hugging Face has a free AI Agents course:
https://huggingface.co/learn/agents-course/en/unit0/introduction
It covers the basics first, then moves into actual frameworks and projects.
The syllabus includes:
- What agents are
- How tools, actions, and observations work
- Agent frameworks like smolagents, LlamaIndex, and LangGraph
- Agentic RAG
- A final project where you build, test, and certify an agent
- Bonus material on observability, evaluation, and function-calling
I like this kind of resource because it does not treat agents as just "LLM plus loop."
For junior devs, the useful concept is the agent control loop:
- The model receives a goal and context
- It chooses an action
- A tool runs that action
- The result comes back as an observation
- The agent decides what to do next
That loop is the core of most agent systems. The framework changes, but the pattern keeps showing up.
If you are already comfortable with Python and basic LLM APIs, this seems like a good weekend learning path. Build the smallest possible agent first. Then add one tool. Then add logging. Then add a human approval step.
That progression teaches more than trying to build a giant "does everything" agent on day one.
Came across this article and thought it was worth sharing here: How to Build Production-Grade Generative AI Applications
It’s a good practical overview of what teams usually learn the hard way after the prototype phase. A few points it gets right:
- not every problem should use an LLM
- model selection should be based on task fit, latency, cost, context window, and safety, not just hype
- prompt engineering matters, but structured inputs/outputs matter just as much
- guardrails, QA, eval pipelines, and tracing are not “later” concerns
- production failures usually come from accuracy drift, hallucinations, cost, and lack of observability
What I liked most is that it frames GenAI systems as engineered products, not prompt demos. That maps well to agentic dev too: once agents can use tools and run longer workflows, monitoring, constraints, and evaluation become first-class design problems.
A lot of teams now say they are “testing AI workflows,” but when you dig in, the actual approach is all over the place.
I’ve seen combinations like:
- mocked unit tests around prompt builders / orchestration logic
- deterministic tests with frozen model outputs
- cheap-model integration tests in CI
- full end-to-end runs nightly
- eval pipelines before release
- production monitoring plus human review
The hard part is balancing:
- cost
- runtime
- brittleness
- confidence
- reproducibility
What I’m trying to understand is what people here do in practice.
Questions:
- What do you test with classic software tests vs evals?
- Where do you mock, and where do you insist on real model calls?
- What runs on every PR vs nightly?
- How do you catch regressions that are not binary failures but “quality drift”?
- What looked promising at first but turned out to be low-value?
Would love concrete examples of test architecture, CI strategy, and lessons learned.
OpenAI published this on April 15: The next evolution of the Agents SDK.
The interesting part is not just “better agents.” It’s that the SDK is moving toward real execution infrastructure for systems that can inspect files, run commands, edit code, and work on longer-horizon tasks inside controlled environments.
That feels important for practical agentic development because the hard part is no longer just model quality. It’s whether the system can execute safely, repeatedly, and observably.
My take:
- the center of gravity is moving from prompt tricks to runtime design
- agent frameworks are becoming more like operating environments
- the real moat is starting to look like execution, safety, evals, and observability rather than raw chat quality
Curious how people here see it:
- Are you using vendor SDKs directly, or building your own orchestration layer?
- What’s still missing most: evals, rollback, state handling, approvals, tracing?
Source: OpenAI Agents SDK update
The new Stanford AI Index is out: 2026 AI Index Report
35B parameters, ~3B active thanks to MoE.
Key points:
- In agentic coding, it reaches the level of models with ~10× larger active parameter count
- Outperforms Qwen3.5-27B (dense) and the previous Qwen3.5-35B-A3B
- Natively multimodal architecture (text + vision)
- In VLM benchmarks, comparable to Claude Sonnet 4.5, and in some tasks performs better
- Strong metrics in spatial reasoning tasks
Benchmarks:
- MMMU - 81.7 vs 79.6
- MMMU-Pro - 75.3 vs 68.4
- MathVista - 86.4 vs 79.8
- RealWorldQA - 85.3 vs 70.3
Practical implications:
- MoE provides a multiple reduction in compute without sacrificing quality
- Well-suited for agent-based scenarios where sequential actions and planning matter
- Can be used as a unified stack for both code and vision tasks
Apache 2.0 (no restrictions for production use)
On one hand, planning is an incredibly powerful capability in AI systems. It opens the door to more autonomous, agent-like behavior and lets models tackle more complex, multi-step problems.
On the other hand, it’s also the part I trust the least right now.
In my experience, I’ve been able to get patterns like reflection and tool use to work quite reliably. They’re much easier to reason about, debug, and iterate on—and they consistently improve application performance.
Planning, though, feels different. It’s harder to predict what the model will actually do, especially ahead of time. Even with careful prompting and constraints, the outcomes can be inconsistent or surprising in ways that are tough to control.
That said, things are moving fast. The progress over the past year alone has been huge, so I’m pretty confident this gap will close sooner rather than later.
How do you evaluate planning? How to monitor?
Hey - glad you’re here 👋
This is a dev-first community of people actually building agentic systems.
We care about practical agentic development:
- real architectures
- real failures
- real tradeoffs
- real systems that (sometimes) work
Relevant Community Topics:
- autonomous agents
- multi-agent setups
- tool use / orchestration
- evals, debugging, reliability
- production lessons
Robotic process automation (RPA) for repetitive e2e tests
Robotic Process Automation (RPA) in testing refers to the use of “software robots” to mimic and repeat the actions that human testers perform when interacting with an application.
Is RPA the same as an automated testing script? No - RPA is not the same as automated testing scripts. It uses the UI to mimic human actions and execute workflows, while automated testing scripts programmatically verify that software behaves correctly.
- RPA = “Do what a user does”
- Test automation = “Check if the system behaves correctly”
According to https://testfort.com/blog/test-automation-trends, RPA adoption in testing is expected to grow significantly as organizations use it to reduce manual labor costs and scale testing efforts alongside AI-driven automation. Something to look after in the industry 👀
LLMs for test case generation are promising - but reliability is still a major issue
Source: https://link.springer.com/article/10.1007/s10586-026-06021-z
A recent review explores how large language models (LLMs) are being used to generate test cases.
Key takeaways:
- Software testing is critical but still time-consuming and labor-intensive
- Traditional automated methods (search-based, constraint-based) often:
- lack coverage
- produce less relevant test cases
- LLMs introduce a new approach:
- understand natural language requirements
- generate context-aware test cases and code
- directly translate requirements to test cases
- LLM-based approaches show promising performance vs traditional methods
Open issues:
- Lack of standard benchmarks and evaluation metrics
- Concerns about correctness and reliability of generated tests
In practice, reliability seems like the biggest blocker - LLMs generate tests that look correct but often miss edge cases or assert the wrong behavior. Or they focus on retesting some obvious scenarios multiple times ignoring actual unit responsibility in the surrounding system.
What is your experience generating tests with AI?
Are you into testing AI agents?
From https://devops.com/is-your-ai-agent-secure-the-devops-case-for-adversarial-qa-testing/
>The future belongs to organizations that recognize “sunny day” testing is no longer enough. The teams that build the “storm simulators” now will operate with a level of confidence and security that their competitors cannot match.
They suggest simulating network failures, ambiguous requirements and prompt injection to see if an agent maintains safe behavior. The message is that AI agents are part of our software stack now, and they need to be tested with creativity.
What do you think?
My experience coding with AI has never been like 10× faster (more like 0.8× hehe). Sure, AI copilots can generate OK looking code, but for me it has mostly been a waste of time. The tech debt is leveraged, learning is slower, and you often end up spending more time fixing things than if you had just written the code by hand much more simply (without AI).
I tend to see more benefits from AI code generation when it’s used with Test-Driven Development (TDD), at least when starting with end-to-end or integration tests first. I also shared my thoughts on this on YouTube: https://youtu.be/Mj-72y4Omik
Some developers argue that TDD is too slow and that you should focus on end-to-end tests (writing them manually) and let AI generate unit tests. That kind of works. But when it comes to learning Python (especially for beginners), I see a lot of frustration from overusing AI. TDD seems like a nice approach to avoid just relying on AI.
What do you think?
New Dev Intros 🎉
Congrats on becoming a member of r/PracticalTesting community 🎉
Every great software community starts with people like you - developers who care about building, testing, and shipping great software products.
This space is all about practical testing: real-world approaches, useful tools, lessons learned, and honest discussions about what actually works (and what doesn’t).
Whether you’re here to learn, share your experience, or ask questions — you’re in the right place.
To get started:
- Introduce yourself 👋
- Share what you’re currently working on
- (Optionally) Tell us more about your background/experience in testing
Let’s build a community where testing is not just theory, but something that truly helps us ship better code 🚀
Takeaways from the book "Unit Testing: Principles, Practices, and Patterns"
I am reading "Unit Testing: Principles, Practices, and Patterns" by Vladimir Khorikov right now. The main idea that stuck with me is to focus on test value instead of chasing coverage numbers or clever frameworks.
Source: \"Unit Testing Principles, Practices, and Patterns\" by Vladimir Khorikov
The book pushes hard on making tests about behavior and risk rather than about methods and branches. Really great book! Highly recommend this for reading.
CloudBees Smart Tests is now GA - using AI test intelligence in CI?
CloudBees just announced general availability of Smart Tests, their AI driven test intelligence product for CI/CD.
Source: https://www.cloudbees.com/newsroom/cloudbees-smart-tests-brings-control-to-ai-generated-code
CloudBees just announced general availability of Smart Tests, their AI driven test intelligence product for CI/CD.
The pitch is simple - instead of running every test on every change, Smart Tests learns which tests matter most for a given commit and runs those first.
Given how much AI generated code is now flowing through pipelines, this feels like a pretty important direction for test tooling.
WDYT?
paper on “systemic flakiness” - flaky tests are not random noise
There is a 2025 paper called “Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures”.
👉 https://arxiv.org/abs/2504.16777
They looked at 10,000 test suite runs from 24 Java projects and found 810 flaky tests. The key claim is that flaky tests often fail in clusters that share root causes. They call this pattern “systemic flakiness”.
About 75 percent of flaky tests in their dataset belonged to some cluster.
They show that fixing a shared cause can remove many flaky tests at once. Common causes were unstable networks and flaky external dependencies.
We should search for shared root causes, not only patch single tests. This could be very relevant for teams that drown in flaky UI or API suites.
Thoughts on “The Pyramid of Unit Testing Benefits”?
I went back to Gergely Orosz’s article “The Pyramid of Unit Testing Benefits” and it hit harder than before.
👉 https://blog.pragmaticengineer.com/unit-testing-benefits-pyramid/
He talks about how unit tests start with basic validation but then stack into better design, living documentation, safer refactors, and faster iteration over time.
The idea that the real payoff shows up years later might explain why experienced devs fight hard to keep tests, while juniors often see them as a chore.