u/ai-lover

Something interesting dropped this week in the agentic AI space. Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain.

Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain. The idea is straightforward: instead of manually iterating on system prompts and tool definitions, a meta-agent does the iteration for you overnight.

It modifies agent. py — the single file containing the system prompt, tool definitions, and orchestration logic — runs the benchmark, checks the score, keeps the change if it helped, reverts if it didn't, and repeats.

The human's only job is writing program.md, a plain Markdown file that tells the meta-agent what kind of agent to build.

In a 24-hour run, it reached #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%). Every other entry on those leaderboards was hand-tuned by humans.

A few things worth noting for devs thinking about this:

-- On the architecture: Tasks follow Harbor's open format and run inside Docker containers, so the approach is domain-agnostic. Any task you can express as a numeric score (0.0–1.0) becomes something the meta-agent can optimize against.

-- On model pairing: Community discussion around the project has surfaced an interesting observation — when a Claude meta-agent optimized a Claude task agent, it seemed to diagnose failure modes more accurately than when optimizing a GPT-based agent. The researchers called it "model empathy." It's an early empirical observation, not a formal result, but worth keeping in mind when choosing your meta-agent.

-- On what this changes practically: The shift isn't dramatic in terms of tooling, you still write prompts, define tasks, and review outputs. What changes is the iteration loop. Rather than running that loop manually, you delegate it.

The repo is MIT-licensed. Requirements are Docker, Python 3.10+, and uv.

Full analysis: https://www.marktechpost.com/2026/04/05/meet-autoagent-the-open-source-library-that-lets-an-ai-engineer-and-optimize-its-own-agent-harness-overnight/

Repo: https://github.com/kevinrgu/autoagent/tree/main

marktechpost.com

u/ai-lover — 6 hours ago

▲ 13 r/machinelearningnews

How to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn Workflows

In this tutorial, we explore the full capabilities of Z.AI’s GLM-5 model and build a complete understanding of how to use it for real-world, agentic applications. We start from the fundamentals by setting up the environment using the Z.AI SDK and its OpenAI-compatible interface, and then progressively move on to advanced features such as streaming responses, thinking mode for deeper reasoning, and multi-turn conversations. As we continue, we integrate function calling, structured outputs, and eventually construct a fully functional multi-tool agent powered by GLM-5. Also, we understand each capability in isolation, and also how Z.AI’s ecosystem enables us to build scalable, production-ready AI systems.....

Full Tutorial: https://www.marktechpost.com/2026/04/03/how-to-build-production-ready-agentic-systems-with-z-ai-glm-5-using-thinking-mode-tool-calling-streaming-and-multi-turn-workflows/

Full Coding Notebook: https://github.com/Marktechpost/AI-Tutorial-Codes-Included/blob/main/Agentic%20AI%20Codes/glm5_agentic_systems_tutorial_Marktechpost.ipynb?short_path=ff9bf2c

marktechpost.com

u/ai-lover — 1 day ago

🔥 Hot ▲ 245 r/singularity+1 crossposts

Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

Here is how MARL algorithm design used to work:

- A researcher notices that discounting old regrets helps convergence. They try fixed α and β. It works. Someone else tries predictive updates. Also works. Years of incremental manual refinement, each step guided by mathematical intuition.

Here is what DeepMind just showed:>

- Give AlphaEvolve the CFR source code and a fitness signal (exploitability after 1000 iterations). Let Gemini 2.5 Pro mutate the update logic. Run on proxy games. Repeat.

- What emerged — VAD-CFR — dynamically adapts discount factors based on regret volatility, applies asymmetric boosting to positive regrets, and delays policy averaging until iteration 500. None of these are obvious. The 500-iteration warm-start threshold was generated without the LLM knowing the eval horizon was 1000.

- For PSRO, the system discovered SHOR-PSRO: a hybrid meta-solver that automatically anneals from population diversity to equilibrium refinement — a transition researchers have always tuned manually.

Both algorithms are tested on training games, then evaluated on larger unseen games with no re-tuning.

VAD-CFR: 10/11. SHOR-PSRO: 8/11.

The search space here is expressive enough to recover all known CFR variants as special cases. What it found instead suggests there is a lot of room human intuition has not explored.

Read the full analysis: https://www.marktechpost.com/2026/04/03/google-deepminds-research-lets-an-llm-rewrite-its-own-game-theory-algorithms-and-it-outperformed-the-experts/

Paper: https://arxiv.org/pdf/2602.16928

marktechpost.com

u/ai-lover — 1 day ago

▲ 24 r/machinelearningnews

The Technology Innovation Institute (TII) Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

By processing image patches and text tokens in a shared parameter space from the first layer, the model allows the prompt to influence visual feature formation throughout the entire stack.

The Technical Shifts:

- Hybrid Attention Masking: Image tokens attend bidirectionally for global context, while task tokens attend causally to the full visual prefix.

- Chain-of-Perception: Instead of parallel mask queries, the model predicts objects as an autoregressive sequence: <coord> -> <size> -> <seg>. This resolves spatial ambiguity before pixel-level refinement.

- GGROPE (Golden Gate ROPE): To preserve 2D grid relationships in flattened sequences, the model uses an isotropic attention mechanism robust to rotation and aspect ratio variations.

- Muon Optimization: Specialized heads (coordinate, size, segmentation) often lag behind the backbone during training; the authors report that using the Muon optimizer specifically for these heads reduced training losses.

Key Empirical Results:

-- Spatial Understanding: On the new PBench (Level 3), Falcon Perception achieved a +21.9 point gain in Macro-F1 over SAM 3.

-- Dense Environments: The model remains stable in crowded scenes, scaling up to 600 instances per expression via an autoregressive interface.

-- OCR Efficiency: A 300M-parameter variant, FalconOCR, achieves 80.3% accuracy on olmOCR, matching or exceeding several systems an order of magnitude larger

Full analysis: https://www.marktechpost.com/2026/04/03/tii-releases-falcon-perception-a-0-6b-parameter-early-fusion-transformer-for-open-vocabulary-grounding-and-segmentation-from-natural-language-prompts/

Paper: https://arxiv.org/pdf/2603.27365

Model Weight: https://huggingface.co/tiiuae/Falcon-Perception

Repo: https://github.com/tiiuae/falcon-perception

Technical details: https://huggingface.co/blog/tiiuae/falcon-perception

marktechpost.com

u/ai-lover — 2 days ago

▲ 20 r/machinelearningnews

Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning

In this tutorial, we build a complete end-to-end pipeline using NVIDIA Model Optimizer to train, prune, and fine-tune a deep learning model directly in Google Colab. We start by setting up the environment and preparing the CIFAR-10 dataset, then define a ResNet architecture and train it to establish a strong baseline. From there, we apply FastNAS pruning to systematically reduce the model’s complexity under FLOPs constraints while preserving performance. We also handle real-world compatibility issues, restore the optimized subnet, and fine-tune it to recover accuracy. By the end, we have a fully working workflow that takes a model from training to deployment-ready optimization, all within a single streamlined setup.

Full Tutorial: https://www.marktechpost.com/2026/04/03/step-by-step-guide-to-build-an-end-to-end-model-optimization-pipeline-with-nvidia-model-optimizer-using-fastnas-pruning-and-fine-tuning/

Check out the Full Implementation Coding Notebook: https://github.com/Marktechpost/AI-Tutorial-Codes-Included/blob/main/Deep%20Learning/nvidia_model_optimizer_fastnas_pipeline_marktechpost.py

marktechpost.com

u/ai-lover — 2 days ago

▲ 13 r/machinelearningnews

Are massive LLM API costs crippling your OpenClaw? The new shift is toward local, agentic AI, and the combination of Google Gemma 4 and NVIDIA GPUs is changing the economics and performance of AI development.

Here's the breakdown:

-- Zero-Cost Inference: By running the omni-capable Google Gemma 4 family (from E2B/E4B edge models to 26B/31B high-performance variants) locally on NVIDIA RTX AI PCs, DGX Spark, or Jetson Orin Nano, developers eliminate the astronomical "Token Tax" entirely.

-- Lightning-Fast Speed: NVIDIA Tensor Cores provide up to 2.7x inference performance gains, making continuous, heavy agentic workloads financially viable and delivering instant, zero-latency results.

-- Agentic Platforms: Platforms like OpenClaw enable the creation of personalized, always-on assistants that automate complex workflows (e.g., real-time coding assistants). For enterprise security, NeMoClaw adds policy-based guardrails to keep sensitive data offline and secure from cloud leaks

The potential is boundless: from ultra-efficient Edge Vision Agents to secure Financial Assistants, local AI powered by this stack is the future of low-latency, privacy-preserving, and cost-free generative AI....

Read the full analysis: https://www.marktechpost.com/2026/04/02/defeating-the-token-tax-how-google-gemma-4-nvidia-and-openclaw-are-revolutionizing-local-agentic-ai-from-rtx-desktops-to-dgx-spark/

Model: https://huggingface.co/collections/google/gemma-4

NVIDIA Technical blog: https://developer.nvidia.com/blog/bringing-ai-closer-to-the-edge-and-on-device-with-gemma-4/

NVIDIA Jetson Orin Nano: https://pxllnk.co/uljngzl

DGX Spark: https://pxllnk.co/1gje7gv

marktechpost.com

u/ai-lover — 3 days ago

🔥 Hot ▲ 51 r/machinelearningnews

IBM has released Granite 4.0 3B Vision, a multimodal model specifically optimized for enterprise document extraction and structured data parsing

The technical release highlights include:

-- Architecture: The model is delivered as a LoRA adapter (~0.5B parameters) designed to run on top of the Granite 4.0 Micro (3.5B) dense backbone.

-- Vision Encoder: It utilizes the google/siglip2-so400m-patch16-384 encoder.

-- DeepStack Injection: Rather than a single projection point, the model employs a variant of the DeepStack architecture with 8 injection points. This routes abstract semantic features into earlier layers and high-resolution spatial details into later layers for precise layout awareness.

-- Specialized Training: The model was refined using ChartNet, a million-scale dataset developed via a code-guided data augmentation pipeline (aligning plotting code, rendered images, and source tables).

-- Benchmarks:

VAREX: 85.5% zero-shot Exact Match (EM) accuracy for KVP extraction.
Chart2Summary: 86.4% accuracy on the human-verified ChartNet test set.
Table Extraction: Leads on PubTablesV2 (92.1 TEDS cropped) and OmniDocBench (64.0 TEDS).

Full analysis: https://www.marktechpost.com/2026/04/01/ibm-releases-granite-4-0-3b-vision-a-new-vision-language-model-for-enterprise-grade-document-data-extraction/

Model weight: https://huggingface.co/ibm-granite/granite-4.0-3b-vision

Technical details: https://huggingface.co/blog/ibm-granite/granite-4-vision

marktechpost.com

u/ai-lover — 3 days ago

▲ 30 r/machinelearningnews

Z.ai has introduced GLM-5V-Turbo, a new multimodal coding model built for workflows where screenshots, videos, document layouts, and GUI states need to be converted into executable actions or code.

What stands out is the system design: Native Multimodal Fusion, CogViT, MTP architecture, 200K context, 128K output, and 30+ task joint RL across perception, reasoning, grounding, and agent execution.

The model is positioned for vision-based coding, tool use, GUI agents, and integrations with frameworks like Claude Code and OpenClaw.

Key Points:

Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts.
Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for multimodal coding, tool use, and GUI Agents.
Deep Adaptation for Claude Code and Claw Scenarios: Works in deep synergy with Agents like Claude Code and OpenClaw.

Full analysis: https://www.marktechpost.com/2026/04/01/z-ai-launches-glm-5v-turbo-a-native-multimodal-vision-coding-model-optimized-for-openclaw-and-high-capacity-agentic-engineering-workflows-everywhere/

Technical details: https://docs.z.ai/guides/vlm/glm-5v-turbo

Try it here: https://chat.z.ai/

marktechpost.com

u/ai-lover — 4 days ago

▲ 32 r/machinelearningnews

Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained on 28T Tokens with Scaled Reinforcement Learning

LFM2.5-350M is a 350M parameter small language model trained on 28 trillion tokens, with a hybrid architecture built from 10 double-gated LIV convolution blocks and 6 GQA blocks, plus 32K context support.

This model is built for instruction following, tool use, structured extraction, and edge deployment. Liquid AI team reports 76.96 on IFEval, 30.64 on GPQA Diamond, and 40.4K output tokens/sec on a single H100 at high concurrency.

The bigger point: small models are becoming serious infrastructure components for local and agentic workloads.

Key Points:

-- Best-in-class performance: A 350M model rivaling much larger models, bringing high-quality AI to your pocket.

-- Fast edge inference: 313 tok/s decode on AMD CPU, 188 tok/s on Snapdragon Gen4. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.

-- Scaled training: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.

Full analysis: https://www.marktechpost.com/2026/03/31/liquid-ai-released-lfm2-5-350m-a-compact-350m-parameter-model-trained-on-28t-tokens-with-scaled-reinforcement-learning/

Model weight: https://huggingface.co/LiquidAI/LFM2.5-350M

Docs: https://docs.liquid.ai/lfm/getting-started/welcome

marktechpost.com

u/ai-lover — 4 days ago

🔥 Hot ▲ 61 r/machinelearningnews

Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction. This is one of the more technically interesting multimodal system updates in recent months.

What stands out is not just text + audio + video support. It is the Thinker-Talker design, support for semantic interruption, turn-taking intent recognition, 256K context, 10+ hours of audio input, and 400+ seconds of 720p audio-visual input at 1 FPS.

- The Thinker (Reasoning Center): Powered by a Hybrid-Attention Mixture of Experts (MoE), it handles a massive 256k context window. We’re talking 10+ hours of audio or 400 seconds of 720p video at 1 FPS. It uses TMRoPE (Time-aligned Multimodal RoPE) to ensure temporal grounding—so it actually knows when things happen in a video.

- voice The Talker (Synthesis Center): No more "AI stuttering." Using ARIA (Adaptive Rate Interleave Alignment), the model dynamically synchronizes text and speech tokens. This gives us sub-second latency (~211ms) and allows for semantic interruption. Yes, it can tell the difference between you coughing and you actually trying to stop it from talking.

- The "Vibe Coding" Evolution: This isn't just text-to-code. Through native multimodal scaling, Qwen3.5-Omni can watch a video of a UI bug or a hand-drawn React sketch and generate functional code based on your verbal "vibe" instructions.

Key Technical Stats:

--- Native AuT Encoder: Trained on 100 million hours of audio-visual data.

--- Benchmark Dominance: SOTA on 215 subtasks, outperforming Gemini 3.1 Pro in general audio reasoning.

--- Deployment: Available via Alibaba Cloud Model Studio (Plus, Flash, and Light tiers).

Full analysis: https://www.marktechpost.com/2026/03/30/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction/

Technical details: https://qwen.ai/blog?id=qwen3.5-omni

Qwenchat: https://chat.qwen.ai/

Online demo on HF: https://huggingface.co/spaces/Qwen/Qwen3.5-Omni-Online-Demo

Offline demo on HF https://huggingface.co/spaces/Qwen/Qwen3.5-Omni-Offline-Demo

marktechpost.com

u/ai-lover — 5 days ago