![979,200 evaluation episodes measuring RL behavioral stability - reward explains 3.7% of stability variance [results + code]](https://preview.redd.it/6n24vpbv42tg1.png?width=140&height=104&auto=webp&s=b0da30c5d5ace71cfd991a133ad5c19e3e1de90d)
979,200 evaluation episodes measuring RL behavioral stability - reward explains 3.7% of stability variance [results + code]
Hi Everyone. Sharing the complete results from ARCUS-H, a post-hoc evaluation harness measuring behavioral stability of trained RL policies under structured stress.
What ARCUS-H does
Three-phase protocol (pre/shock/post) applied to any SB3 policy. Eight stressors across three failure axes:
- Perception: CD (concept drift) · ON (obs noise) · SB (sensor blackout)
- Execution: RC (reward compression) · TV (actuator corruption)
- Feedback: VI (reward inversion) · RN (reward noise)
Five channels: Competence · Policy Consistency · Temporal Stability · Observation Reliability · Action Entropy Divergence
No retraining. No model internals.
Scale
51 (env, algo) pairs · 12 environments · 8 algorithms · 8 stressors · 10 seeds · 979,200 evaluation episodes
Finding 1: r = +0.240 [0.111, 0.354]
This is the primary number (env stressors only, VI/RN excluded). compare.py also outputs r = +0.311 for all 8 stressors — that number is inflated by circularity: VI and RN corrupt the reward signal, which is 15% of the ARCUS score formula. Don't cite 0.247 as the main result.
Spearman r = +0.180. R² = 0.057.
Earlier pilot on 47 pairs: r = 0.286 [0.149, 0.411]. The decrease to 0.240 reflects adding SpaceInvaders and Walker2d. The CI narrowed by 69%. The full evaluation is more reliable and more diverse.
Finding 2: SAC 92.5% vs TD3 61.0% under observation noise
Replicated across 51 pairs and 10 seeds.
Finding 3: Pong 41.9% vs SpaceInvaders 13.0% under obs noise
Same CNN. Same wrapper. Representation structure, not architecture.
Finding 4: Walker2d-v4 (new)
FPR = 0.053. MuJoCo fragility confirmed on a third locomotion env.
Code and data