r/reinforcementlearning

Multi-armed Bandits

Hi all, I wanted to get some insights on solving a problem that I'm trying to model as a bandit. I'm fairly new to the subject, so if I'm saying nonsensical things, please explain. Basically, the idea is that pulling an arm gets you a reward, but that reward depends on some factors that change, so pulling the same arm again won't give the same reward. I tried to use epsilon greedy, and things sort of make sense. But, if I want to try UCB or Thompson sampling using Gaussian, it is unclear whether it would be appropriate. Because there is no need to keep pulling an arm if its reward is low when it has been tried only a few times. Depending on the reward design, it indicates that this need not be pulled. Arms, as such, may only be occasionally visited (like in epsilon). So, would this sort of behavior only be like a cold-start problem, and would Thompson eventually learn not to pick it? But how soon would that eventually be? I would appreciate any insights, and I can clarify more if needed, thanks!

reddit.com
u/Leather_Amount_2268 — 15 hours ago

Agent Systems - Discussion

What y'all think of the new "agentic" era, pay 200$ to Anthropic to automate a simple task, I really like the idea of automation with reasoning models, but it seems that now everyone can do one, I don't feel comfortable in the current market is like a dystopia,

As a reinforcement learning enthusiast in this sub, do you think this is the lowest moment of humanity? (I do),
How much time do you think this "era" is going to exist? Is it forever?

I am really sad with 2026 honestly, I just think in the line of "The Incredibles":

And when everyone is super...  no one will be!

reddit.com
u/Volta-5 — 17 hours ago

Isaaclab GPU recommendation

hey guys I’m new to this whole subject. As the title says I need help upgrading my GPU.

I’m working on my capstone mechanical engineering project, a quadrupedal robot. I decided a few weeks ago that it needed to be trained using Isaac lab. Currently I have isaac sim 6 and isaac lab 3 in a container on my laptop with a 2070.

I’m switching to a desktop but what do you guys think is a better GPU for this software, 3060 12gb or 3080 10gb?

reddit.com
u/EstateMinimum — 20 hours ago
▲ 8 r/reinforcementlearning+1 crossposts

I built a backprop-free RL agent using Hebbian plasticity + Predictive Coding: it nearly matches standard deep RL on Pong (57% vs. 59%)

Neuroscience question that motivated this: can the kind of learning rules we actually see in the brain; Hebbian plasticity, predictive coding, distributional dopamine signals, be sufficient for a real control task?

I tested this on Pong with a fully backprop-free agent:

  • Predictive Coding (Rao & Ballard 1999) for visual feature learning
  • Distributional Hebbian plasticity for value estimation, inspired by Dabney et al. 2020 (the finding that dopamine neurons encode a full distribution over future reward, not just a scalar)

Results: BioAgent reaches 57% vs. PPO's 59%. Close, but self-play training exposed a hard problem: Hebbian rules that adapt fast also forget fast under non-stationary opponent dynamics. The plasticity– stability dilemma shows up immediately.

The dopamine-inspired distributional encoding helped stability compared to a scalar baseline, which I found interesting because it suggests the distributional coding might have a functional role beyond just representing uncertainty.

Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong

Curious what people think about the plasticity–stability angle: Is there a biological mechanism for stabilising Hebbian rules under non-stationarity that I'm missing?

reddit.com
u/ConfusionSpiritual19 — 22 hours ago
▲ 7 r/reinforcementlearning+4 crossposts

self-promotion thread

I’m working on a small open repo focused on physics-informed AI for manufacturing.

The goal is not to release a production model, but to create lightweight templates for deciding whether a manufacturing workflow is actually AI-ready: clear inputs/outputs, controllable variables, feedback loops, sparse-data constraints, and where physics priors may help.

Would appreciate feedback from people working on ML for physical systems, scientific ML, or industrial AI.

Repo: https://github.com/programmablemanufacturing/programmable-manufacturing-lab

▲ 17 r/reinforcementlearning+1 crossposts

Remote MuJoCo / Robotics RL opportunity — contractor role

I recently joined Alignerr for a different technical role and noticed they’re looking for people with hands-on MuJoCo / robotics simulation / reinforcement learning experience.

The role seems best suited for people who have worked with MuJoCo, MJCF/XML, Gymnasium/dm_control, reward shaping, PPO/SAC/TD3, physics debugging, and robot control.

It’s remote contractor work. I don’t want to oversell it because project availability can vary, but the listed rate is high and it may be worth checking out if you already have this background.

I have a referral link, but only reach out if you genuinely have MuJoCo/RL experience — this probably isn’t a beginner-friendly role.

reddit.com
u/Asimpleyoungkid — 1 day ago

Looking for an RL study/project accountability partner

Hey folks,

I'm in the midst of some interview prep / learning RL (right now working through spinningup, trying to code/derive some algos from scratch, and building a few example projects) somewhat from scratch. I've found that having accountability is really helpful for making sure progress is made.

Anyone in the same boat who wants an accountability partner? I imagine daily/regular checkins, progress on learning/projects (aka a mini "build in public"), feedback on each others plans, and even some collaboration.

Thanks and If so, DM me!

reddit.com
u/temp12345124124 — 3 days ago

When Chaos Wins: noisy net eval with noise off gave wildly inconsistent results. Turning it back on fixed everything.

Running a Rainbow DQN ablation on Snake (C51 + dueling + noisy nets). When I evaluated checkpoints with noise off (mean weights, sigma zeroed out, the standard approach), the scores were all over the place. Some checkpoints averaged 78, others averaged 18. Training curve at those same points was perfectly stable.

First instinct was a bug. Checked everything. It wasn't.

The worst case was at ep450K. Deterministic eval produced a bimodal distribution: ~25% of episodes scored near zero, ~75% scored above 80. The average was 59 but that number is meaningless with two separate peaks and nothing in between.

What's happening: the mean-weight policy has traps. Game states where Q-values for two actions are nearly identical. Without noise, the agent picks the same action every time. If it's the wrong one, it loops and dies. 25% of starting states consistently hit these traps.

Same checkpoint, same seeds, noise turned back on: bimodal failure mode vanished entirely. p25 jumped from 2 to 59. Average went from 59 to 73. Std dropped from 42 to 26. This held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval across the board.

The noise isn't residual exploration overhead. The agent learned a policy where the sigma values are functional. They provide just enough Q-value perturbation to prevent degenerate action loops. Zero them out and you get a policy that's strictly worse than what the agent actually learned.

Snake makes this especially acute because a single wrong turn at length 100+ is immediately fatal. The deterministic traps are lethal in a way they wouldn't be in more forgiving environments.

One caveat: at one very late checkpoint where sigma had grown extremely large, stochastic eval finally dropped below deterministic. There's a productive zone for noise magnitude, and past it the noise becomes destructive. So it's not "always evaluate with noise." It's "don't assume deterministic eval is automatically the ground truth."

Has anyone else seen this kind of eval divergence with noisy nets? Curious whether it's specific to tight spatial environments like Snake or shows up more broadly.

reddit.com
u/statphantom — 3 days ago
▲ 78 r/reinforcementlearning+2 crossposts

github: https://github.com/amathislab/musclemimic

MuscleMimic is a JAX-based motion imitation learning research benchmark specifically designed for biomechanically accurate muscle-actuated models. It focuses on advancing research in muscle-driven locomotion and manipulation through high-performance neural policy training. 

u/CharlieLee666 — 5 days ago

How should I plan my learning path for reinforcement learning courses?

Hi everyone, I have a question about planning my reinforcement learning studies.

I'm currently a sophomore majoring in a non-CS field. My math background includes calculus, probability and statistics, linear algebra, and some mathematical analysis. I want to start learning reinforcement learning, but according to many recommendations, it seems I may also need additional math courses such as ODEs, real analysis, stochastic processes, etc.

Is that really necessary at my current stage? Or would it be better to learn those topics along the way?

I'd also appreciate any suggestions about how to study reinforcement learning itself (courses, prerequisites, learning path, etc.). So far, the only programming language I’m comfortable with is Python.

reddit.com
u/AddressFancy3675 — 3 days ago
▲ 11 r/reinforcementlearning+4 crossposts

ML with Finance

Hi, I am an MTech student in computer science. I want to work on finance domain with machine learning. So can you suggest me some research topic. On which we can work for last year thesis. During my MTech my major focus on machine learning and deep learning around topic. But I have an interest in the finance domain also I did some project like https://github.com/Zdong104/FNSPID_Financial_News_Dataset with market regime. But now I am finding an solid research topic for the my final year. Is there any suggestion for this ?

u/Gullible_Space_4070 — 4 days ago

Teaching Humans using Expert RL Policies

RL is powerful enough to train superhuman policies, especially in video games. But is there any research on how to leverage RL's policy/value networks to improve human training speed? How can we apply behavioral cloning to humans?

Past research has shown that simply providing a human with optimal moves doesn't improve their pattern recognition or performance, it only increases their reliance on the feedback, making them worse.

Humans use some form of RL to learn motor skills and are more sample-efficient than algorithms. So, using guidance from expert policies, we can teach humans to learn along optimal trajectories, reducing time wasted in exploration.

Surely, with the help of value predictions, one can determine whether an action was suboptimal, helping solve the credit assignment problem. But what are the optimal ways to signal that to a human(e.g., either provide a number on the screen, display red/green colors, or perhaps electrocute them?)

reddit.com
u/MaxedUPtrevor — 4 days ago

Is RL post-training in 'imagined environments' a path to continual RL? Trying to understand this deeper

I've been reading more about training in imagined environments, especially the work of the Dreamer series and RialTo, and I'm curious about how this could apply to CL.

Take an example of a robot deployed in a home that notices it has a high failure rate when picking up a specific object (let's say cans in a kitchen). It then builds a world model of the kitchen from it's deployment data, generates can-grasping rollouts within it and RL post-trains in the imagined env, then deploys the new policy.

This feels like continual learning to me? But formal continual learning seems to be more about task sequences (learn A, then learn B, then measure forgetting on A) and the example I'm describing doesn't fit into that. I'm not sure if what i'm describing is deployment-time adaptation, imagined replay for CL, self-improvement loops, or some mix.

Two things I'd like takes on:

  1. Is anyone updating the world model itself continually from deployment data, not just the policy? Most of what I've read keeps the world model frozen post-training.
  2. What breaks first when you actually try the closed loop (deploy → world model update → imagined rollouts → policy update → deploy)? My guess is world model drift compounds but haven't seen it characterized.

Curious what others think.

reddit.com
u/No_Bat_7448 — 6 days ago
▲ 525 r/reinforcementlearning+1 crossposts

Bimo’s walking model now runs natively on a Raspberry Pi Pico at 5ms inference time!

This is Bimo walking completely standalone: no data cable, no external compute, just a battery and an RP2040 (custom board) running the walking policy natively at ~5.2ms inference time.

The main walking model trains on thousands of parallel environments in Isaac Lab. That policy gets distilled down to a tiny student network and compiled directly into the MCU firmware.

Here's the pipeline:

  1. Train a standard 256×128×64 teacher model in Isaac Lab (~5min on an RTX 4080)
  2. Distill it into a 64×32 student network (~30s, yep, I was surprised too)
  3. Export as pure C using onnx2c
  4. Compile into the RP2040 firmware via Arduino IDE
  5. Inference runs at 5.0-5.2ms, comfortably within the 50ms control loop

The full distillation pipeline, the standalone MCU inference code, and the Bimo API ported to ROS2 nodes are all coming in the next update (v1.1). ROS2 was a direct request from the last Reddit post, so that's in.

Has anyone else run RL locomotion policies natively on an MCU? How small have you made the student network before significantly degrading performance?

If you want to follow the development, join the Discord server, all updates go there first. Code update to v1.1 will be available on GitHub soon.

u/mishaurus — 9 days ago

Why people seldom uses GPU-based simulator benchmark for online RL algorithm papers?

well known benchmarks(dm-control, og-bench, humanoid-bench, etc) are based on cpu-simulator, and they are extremely slow.

for publish paper with novel rl-algorithm, we need to use multiple seeds(at least 5) for each benchmarks, and we have to also do some ablations. I think it is too long to test the hyperparameter tuning and conduct ablation tests for cpu-based simulator benchmarks.

But, recent GPU-based simulator benchmarks(mujoco-mjx, isaac gym, isaac lab, mujoco-playground) makes all training so fast. These alternatives are good to test algorithms and hyperparameter tuning but i couldn't found that recent online RL algorithm papers( ex) DIME https://arxiv.org/abs/2502.02316) uses these benchmarks.

reddit.com
u/Vegetable_Pirate_263 — 6 days ago
▲ 9 r/reinforcementlearning+1 crossposts

From Fusion 360 to IsaacLab: training a custom robot with reinforcement learning

Hi everyone,

I recently worked on a small project where I designed a custom robot in Fusion 360 and trained it in IsaacLab using reinforcement learning.

USDZ File

CAD File

The robot is a wheeled biped-style platform. After creating the CAD model, I converted it into a simulation-ready asset, set up the joints, and used it for stabilization and jump-recovery tasks in IsaacLab.

What I found most interesting was how much the physical design affects the learning process. Things like joint placement, link length, wheel contact, collision shapes, inertia, and actuator settings all had a noticeable impact on whether the robot could learn stable behavior.

The first task was basic stabilization, where the robot learns to maintain its posture. I also tested a jump-and-stabilize task, where the robot needs to recover after a more dynamic motion.

This made me realize that building a robot for RL is not just about making a nice-looking CAD model. The morphology, physics properties, and simulation setup are all part of the learning problem.

The workflow was roughly:

Fusion 360 → asset preparation → joint setup → IsaacLab training → policy evaluation

I’m planning to extend this robot to more tasks, including wheeled balance control, push recovery, locomotion, turning, navigation, and object interaction.

I wrote a longer post with more details about the design process and what I learned from training it in IsaacLab.

Stabilize Task

Jump & Stabilize Task

reddit.com
u/Ok-Video-2620 — 7 days ago

Currently experimenting with exploration policies for deep RL on Super Mario Bros - Agent beats all levels I threw at it

I've been playing with deep reinforcement learning for a while. I originally started with a simple DQN, added all improvements from the Rainbow paper, and finally changed C51 for a quantile regression (and plan to swap it for an Implicit Quantile Network).

After implementing C51 (which was my first time with distributional RL) I started playing with policies that take advantage of the learned distributions : By independently taking N samples from each action-value distribution, scoring actions by averaging the samples, and picking the greedy action with respect to these scores, I was able to make the agent learn faster than similar agents using only NoisyNets or an epsilon-greedy policy (I'm still using NoisyNet, this is done on top of it). In the limiting cases, N=1 is just Thompson Sampling and N=+Infinity is just a plain greedy policy.

Finding an optimal value for N proved to be a challenge, so I decided to pick a random value for it at the start of each episode (N = 2**rng.uniform(8,12) for a QR-DQN with 32 quantiles/action works well in my experiments), which led to even better results.

I later found out about DLTV which made the agent discover new behaviors, but performed worse than previous experiments overall. Inspired by it, I tried something I did not find in previous works and got the best results out of all my previous experiments :

At each time step, compute an exploration_score as the ratio of "intra-action variance" over "inter-action variance" (rendered latex equation). I then take N/exploration_score samples from each distribution, and pick an action as described above. (more details at the end of this post)

For anyone reading this, I have a few questions :

  1. Are you aware of any previous work I missed that tries similar exploration policies with distributional RL (interpolating between Thompson sampling and the greedy policy)
  2. Most papers I found about learning from multiple exploration policies seem to be in the context of multi-actor parallelization. Is there any novelty in randomizing the policy parameters at the start of each episode, especially in the single-actor case ?
  3. Is any part of what I'm doing worth the time it would take to quantitatively evaluate it ? I've been doing it mainly for learning and fun and have only qualitatively evaluated it so far. However, if there's a chance I can contribute to the field, I'll gladly make some time to compare it to published papers on ALE.

=======================

I actually track a moving average and standard deviation of the exploration score, which lets me shift/rescale its values to a target average and standard deviation, and divide N by the shifted/rescaled value. I initially started with a target average of 1 and standard deviation of 1 as well (which gave good results), then tried randomizing these parameters at the start of each episode as well. This led to a lot more diversity in the policies and even better results.

Since this worked so well, I additionally randomized the noise strength in the NoisyNet layers.

Overall, this made the agent a lot more robust to deviating from what it considers to be the optimal trajectory, and allowed it to learn complex behaviors previous iterations were never able to learn (e.g. taking a few steps back to gain momentum, waiting for good cycles, or dodging hammer bros)

=======================

For anyone interested, I made a live stream of the training in progress with graphs and some more details on the experiments I'm running. The current training run was started 8 days ago, and the agent is able to finish all stages (it's not finishing them all every try though)

=======================

Edit : formatting

=======================

Edit 2 : More details :

Available actions : The agent does not have access to the up and down buttons, the available actions only use left, right, A and B.

Adding the down button would double the total number of actions (because down can be pressed on top of all available actions).

Reward function : It mainly consists of reward(t) = max(0, x(t) - previous_best_x) + a larger reward for beating a stage. I had to tweak the scaling of both components.

I initially had penalties for time and death, but one made the agent suicidal in front of hard-to-overcome obstacles, while the other made it fear them too much and hug the left side of the screen. Removing both proved to increase the performance.

One trick that seems to help with most '*-3' levels (which have a lot of void to fall into) was to hold the reward while the vertical velocity of Mario is negative (meaning it is falling). Without this trick, the agent would sometimes get stuck learning to jump the farthest it can into the void.

Stage scheduling : Each episode is one attempt on one level. At the start of each episode, a stage is randomly picked with probability proportional to 1/(number of times the stage was beaten) among the unlocked stages. Each stage is unlocked after the previous one has been beaten 30 times, with only 1-1 unlocked at the start of the training.

Available stages : The first iterations of the agent were unable to learn maze castles (4-3, 7-3 and 8-4), so I removed them all. The reward function will give rewards for the first path the agent tries, then the agent will be teleported back by the game and no reward is received until it finds the right path and gets past the point where the game teleported it back. I plan to test newer (better) versions of the agent on these stages only and see if mazes can be re-added to the pool.

I've also removed underwater stages (2-2 and 7-2). The agent can learn them fine, but the game dynamics are really different from all other stages and they're really boring to watch. Since I already removed a bunch of stages, I figured I could remove these as well but I may re-add them with mazes because beating every level is cooler than beating a cherry-picked selection.

Since 8-4 is the only stage that requires going down a pipe, I considered it was not worth it to add the down action and will likely never re-add it to the pool, which would unfortunately be really anti-climactic...

Replay buffer warm-up : After initially using the standard approach of filling the buffer with transitions sampled from a random policy before training the neural net, I came-up with a "soft warm-up" scheme in which the first gradient updates happen after only 2000 transitions, but initially happen every few thousand transitions and gradually become more frequent until the replay buffer is full. Together with my custom exploration policy, this works very well : the agent very quickly starts behaving similar to a "right + random button" policy before learning to actually jump and run.

Custom n-step bootstrapping : When I initially implemented n-step bootstrap targets, I initially used n=3 from the Rainbow paper, noting the same instabilities as the paper did for higher n values. I then found the Retrace(\lambda) paper which seems to successfully address this by increasing n until the online network disagrees with the action choice from a stored transition. This makes n larger where the replay buffer data is on-policy, and smaller when it becomes off-policy. Since my GPU is already maxed and the training is already slow (20.8t/s when real-time is 20t/s) I could not afford the additional computations (building a training sample (s(t), a(t), sum(r(t+0..n)), s(t+n)) needs up to n_max transitions to go through the online network).

I'm trying to achieve similar sample efficiency gains by using cheaper alternatives as proxies for "how off-policy is a given transition" : I'm using the number of times a transition has been sampled, with n = int(max(n_min, n_max * k**times_sampled)) ; 0<k<1. The currently running experiment uses n_max=14, n_min=1 and k=1/1.3. I'm pretty sure it helps early in the training, and it does not collapse like a constant n=14 does

Stream setup : As I said, this is something I do for my own fun, and I really wanted to be able to see the agent learn in real time. The code runs a separate process, to which frames from training episodes are sent in a queue. The process then sends the frames as raw RGB24 to an local UDP socket, to which GStreamer connects and encodes the stream. With a simple MediaMTX configuration, I can manage the Gstreamer process and have the stream available through WebRTC on my LAN.

Then I figured someone else might have fun watching this, so I added a line to my MediaMTX config to send the stream to twitch and youtube. The overlay is a headless browser displaying custom HTML/JS (using d3.js for the graphs) piping raw frames to ffmpeg. GStreamer handles compositing the two streams together into the side-by-side view.

u/pcouy — 9 days ago

Help with reinforcement learning Pick & Place

Currently I am trying to get into reinforcement learning, about two months ago I managed to make a curriculum that teaches my ur10e robot to reach a target within about 6cm.

Ever since then I have attempted to teach it to pick and place, ie. have it start at home position, move towards block, grasp block and move it above treshold or to target.

In those two months I haven't really made any progress and all my attempts of improvements have given me 0 results.

I am wondering if someone with more success could review my code for anything I could change because I have been stumped on this and have no clue what to try next.

Or give me a working example similar to my own, or tips on changes, any advice honestly.

Whats the issue? If I limit my learning to stage 0( reach a point 20cm above block) it succeeds to 100% success ratio in about 1000-2000 episodes but when I load the save and inspect the results it maybe reaches it about 30% of the time (success being 6cm to the target, failures are a bit farther at up to 13cm away) , honestly don't know why.

If I then implement stage 1 then, it falls apart, after 1000 episodes reaches 20% success, after which will fall to 3% and stay 3-10%.

Stage 2 wasn't even tested much because I struggle with stage 0 and stage 1 as is.

ur10e robot arm, 2f85 gripper, Stable baselines 3, gymnasium-robotics, mujoco, SAC+HER curriculum, 1000-2000 episodes with 1000 timesteps each

I have already tried increasing it to something like 10k+ episodes but it just gets stuck at 2k episodes and falls to 0%

https://github.com/OverlordDestro/ur10e_HER_SAC_SB3_GYM

u/Lord_Destro — 7 days ago