u/statphantom — reddlx

Over 1,000 Unique Downloads - Thank You!

Three months ago I posted here that Stat's Cuphead Randomiser was finally live, and today it just hit over 1,000 unique downloads. When I released it I honestly had no idea how it would be received, so to see it reach this milestone is genuinely overwhelming. What makes me even happier is that in three months of being out in the wild, not a single bug has been reported. For a mod of this complexity that hooks deep into the game's systems, that's something I'm incredibly proud of.

A huge shoutout to u/GioTGM for suggesting I upload to GameBanana as well as Nexus Mods - more than half of all downloads have come from there, so that advice made a real difference. And thank you to everyone who downloaded it, tried it, left a comment, or just viewed the original post. The support from this community is what pushed me to finish and polish it to the level it's at today.

Original release post for anyone who missed it: https://www.reddit.com/r/Cuphead/comments/1rfx4ke/randomiser_released/

https://preview.redd.it/4n72r3040s1h1.jpg?width=2520&format=pjpg&auto=webp&s=984e1228ef63619090217943e054cc0abdf7155a

reddit.com

u/statphantom — 3 days ago

▲ 6 r/reinforcementlearning

When Chaos Wins: noisy net eval with noise off gave wildly inconsistent results. Turning it back on fixed everything.

Running a Rainbow DQN ablation on Snake (C51 + dueling + noisy nets). When I evaluated checkpoints with noise off (mean weights, sigma zeroed out, the standard approach), the scores were all over the place. Some checkpoints averaged 78, others averaged 18. Training curve at those same points was perfectly stable.

First instinct was a bug. Checked everything. It wasn't.

The worst case was at ep450K. Deterministic eval produced a bimodal distribution: ~25% of episodes scored near zero, ~75% scored above 80. The average was 59 but that number is meaningless with two separate peaks and nothing in between.

What's happening: the mean-weight policy has traps. Game states where Q-values for two actions are nearly identical. Without noise, the agent picks the same action every time. If it's the wrong one, it loops and dies. 25% of starting states consistently hit these traps.

Same checkpoint, same seeds, noise turned back on: bimodal failure mode vanished entirely. p25 jumped from 2 to 59. Average went from 59 to 73. Std dropped from 42 to 26. This held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval across the board.

The noise isn't residual exploration overhead. The agent learned a policy where the sigma values are functional. They provide just enough Q-value perturbation to prevent degenerate action loops. Zero them out and you get a policy that's strictly worse than what the agent actually learned.

Snake makes this especially acute because a single wrong turn at length 100+ is immediately fatal. The deterministic traps are lethal in a way they wouldn't be in more forgiving environments.

One caveat: at one very late checkpoint where sigma had grown extremely large, stochastic eval finally dropped below deterministic. There's a productive zone for noise magnitude, and past it the noise becomes destructive. So it's not "always evaluate with noise." It's "don't assume deterministic eval is automatically the ground truth."

Has anyone else seen this kind of eval divergence with noisy nets? Curious whether it's specific to tight spatial environments like Snake or shows up more broadly.

reddit.com

u/statphantom — 3 days ago

▲ 8 r/reinforcementlearning

Removing PER from Rainbow DQN improved performance on Snake. New record of 153 on 20×20 grid.

Greetings all! I'm Running a systematic Rainbow DQN ablation on Snake (20×20 grid), adding one component at a time. The most surprising result so far: removing Prioritised Experience Replay (PER) from full Rainbow didn't just match performance, it set a new record.

Full Rainbow (with PER): record 134 C51 without PER (everything else identical): record ~~153~~ 156

Controlled eval at ep50K (20,000 episodes, deterministic, same seeds): C51 without PER outperformed full Rainbow across every percentile. avg +45%, p50 +35%, p90 +39%. Zero overlap between segment distributions.

Tested across 5 seeds. Individual seeds are noisy with occasional flips, but the mean across all 5 favours removing PER.

What I think is the reason: Snake is a dense-reward task. Food is frequent, TD errors are relatively uniform across the buffer, and 2048 parallel environments already ensure replay diversity. PER's priority mechanism has nothing meaningful to prioritise. Meanwhile the IS weight correction still suppresses gradients. You pay the overhead without the benefit.

This is consistent with Hessel et al.'s original context. Their finding that PER was a top-2 Rainbow component was measured on Atari, which is sparse-reward with high TD error variance. Snake is roughly the opposite. Pan et al. and Ivgi et al. have independently documented similar PER underperformance on dense-reward tasks.

Previous best published peer-reviewed result on 20×20 Snake was 62 (Sebastianelli et al., 2021). The 153 is 2.5× that.

Has anyone else observed PER underperforming on dense-reward tasks? Curious whether this generalises beyond Snake. I'm planning to test on Tetris next.

reddit.com

u/statphantom — 11 days ago

▲ 4 r/pytorch

I needed to unpack bit-packed uint8 tensors on GPU for a replay buffer in a reinforcement learning project. Naturally I reached for torch.unpackbits to match NumPy's np.unpackbits.

It doesn't exist. Like, at all. Importing it raises AttributeError. There's been an open feature request on GitHub since 2020 (issue #32867), still not implemented.

So I went looking for community solutions and found this bitmask approach:

mask = 2 ** torch.arange(8, dtype=torch.uint8, device=x.device).reshape(8, 1)
unpacked = (x.unsqueeze(-1) &amp; mask).bool().int().flip(dims=[1])

This works. It preserves the original bit values, converts to binary via .bool().int(), and flips the bit order to match MSB-first convention. Four operations, correct output. But it only handles 1D input and breaks on batched (B, packed_size) tensors, which is exactly what I needed for sampling from a replay buffer.

I also don't need to preserve the original mask values, I just need 0s and 1s. I thought I could do better, and I wouldn't be a programmer if I didn't try for no other reason except... I wanted to?

Here is the solution I came up with:

shifts   = torch.arange(7, -1, -1, device=packed.device, dtype=torch.uint8)
unpacked = ((packed.unsqueeze(-1) &gt;&gt; shifts) &amp; 1).reshape(B, -1)[:, :n_elems]

Two operations. Each packed byte is broadcast against shift values [7, 6, 5, 4, 3, 2, 1, 0]. Right-shifting moves each bit into the LSB position, bitwise & with 1 isolates it. Already MSB-first because the shifts descend, so no .flip(). No .bool().int() because >> shift & 1 always produces 0 or 1 directly. Handles batched input out of the box.

Half the operations, no intermediate bool/int tensors allocated in VRAM, and works on (B, packed_size) without modification. Will reducing two ops make a difference? Probably not, but I saw the opportunity and took it.

My use case was a bit-packed replay buffer for deep RL where binary game states are packed at 1 bit per element for a 6.4x memory reduction vs uint8. Sampling from GPU-resident packed storage needs unpacking on every training step, so fewer allocations do matter at scale.

Every search result I found for this problem gives the bitmask version. Figured I'd share since it took me a while to find any solution at all.

reddit.com

u/statphantom — 17 days ago