u/ryunuck

▲ 2 r/LocalLLaMA+1 crossposts

[D] Reinforcement Learning from Epistemic Incompleteness? (RLEI) Would this work

hi friends, this is just a shot in the dark but I can't stop thinking about it right now:

Have you ever considered doing RLVR on grammar induction with autoregressive LLMs ? (triggered by prompt)

Another way to think of it would be discrete autoencoding, using tokens to engrave models and rewarding for density and shorter description length while penalizing loss of content and information.

The weights self-steer during RLVR towards a regime in which it is increasingly programmable by the tokens, and converge on a structure that is more like a generator for new latent space configured ephemerally by the tokens.

The representation of these models in tokens are alien, yet more transparent and inspectable than weights for AI interpretability and safety. Does that all make sense? Theoretically this is actually what was desired back then with the mesa optimizer capability.

Operations on these models occur in context emergently through inference. For example packing a model is a A u B type operation, which you can think of as being like <object>...</object> fences whose contents look like perhaps

∃∀⌬⇒∈ΣΞ:⇔Θ∈Ψ(⇓φΩ), ∫d∆ ∀Ω∈Σ:∀Ξ∉Ϲ(ΦΩΠ⇌Θ⊗Ψ), ∀Ψ∉Σ:∀ΦΨΣ(ΠϝΣ϶ΣΨ), ∀Ξ∉϶:∀ΣΦΠ(ΦΩϨΠϡ), ∫dϴ ∀ϵ∈Ρ:∀Ψ∉Ϯ(Ϭϭ϶⌬ϬΣ), ∀ΦϳΠ:∀Π∈ϴ(Φ⊕ΣΘϿ), ∀ΠϲΣ:∀ΨϳϹ(ϲ⌬ω⊕ΨΠ), ∫dΩ ∀ϱ∈Σ:∀Φ∈Σ(ΠϫΨ), ∀ϵϱϲ:∀ϻΠΦ(ϵ⊗ϧΒϴ), ∀Φϱϴ:∀Ϭϵϵ(Σ∈Ψϵϯ), ∀ΦπϿ:∀θϳΨ(ϱϳϬϵϻ), ∫dΨ ∀ϯ∈ϕ:∀ΠϴΨ(Ϥ⊗ϴΨΚϷ), ∀Ϭϩϵ:∀σπϣ(Ϡϝϴϸ⊗Ϡϸ), ∀ϿΨϷ:∀Ψϲϭ(ϻ∈ϭ⊗ϽÞΣ), ∀ϴΠϾ:∀ϠϦϭΦ(ϴ∉ϬΦΨϢ), ∫dσ ∀϶∈Π:∀ΠϮϣϳ(Ϧ⊗δϮϬϧ), ∀ΦϷϭ:∀ϲ϶ϳ(Ϲ⊕ϯ↻ΓϦ), ∀θϦϤ:∀ϴ∈ΨϬϬ(ϱ≈Φϳϧ), ∀ΠϿϳ:∀Ϭ∉Π(ϱ∈Ϧ⊕ϭι), ∫dΣ ∀ϧ∈Π:∀ϣϳϧ(ΦΣϵϧΣΨ), ∀ϵϷϼ:∀Ϧ∈ϳϧ(ϾϢϹΦΠϲ), ∀ϼΘΨ:∀ϬϷΠ(ϹΘΦϣϱ), ∀ϽϠϦ:∀ϦϴϿ(ϧΘϺϴϮ), ∫dΩ ∀ϤΘΦϺ:∀ϳΨϭ(Θ⊗ϭϣϲϺ), ∀ϤϹϣ:∀ϢϳϹ(ϦΦϾΘϠ), ∀ϣϯϩ:∀Ϯϴϰ(ϣΞϴΣϲ), ∀ϡϥΨ:∀ϿΘϣ(ϴΣ϶ΘϥϾ), ∫dϺ ∀ϦϨϦϥ:∀ϴΣϽ(ΣΨϵ⇒ϭϴ), ∀ϲϺϱ:∀ΨϴΣ(ΘϠϲϷΨ), ∀ΨϬϦ:∀Ϥ∈ϭ(Φ⊗ΨΠΠΣ), ∀ϴϠϾ:∀ΨϿΠ(ϥϔΦΦϨϤϵ), ∫dϯ ∀ϥϦϹ:∀ϭϭϳ(ΨϳυϽϣ), ∀ϡϺϵϲ:∀ϿΨΦϦ(Ϥ⊗ϡϿϦΠ), ∀ϥϢϺΨ:∀ΘϿΦ(Ϥ϶

I would pretrain the interface with reconstruction/distillation first, then use RL to shrink and stabilize the code. (both is verifiable reward)

Since the weights already encode vast information about the world, the hope is that creativity is more a thing of composition and structure. So your context-level models are acting like rich compositional indices over the high-dimensional embedded knowledge and features in the weights.

This should take us out of RLVR and into RLEI where the reward is intrinsic. With RLVR you can only reward what you can verify, and that doesn't extend to everything we care about.

In RLEI, the reward signal is generated by its own representations. The model knows where the representation is incomplete because there is a clear measure: it costs more tokens. Uncertainty is entropy. A governing law it finds that explains a thousand observations costs fewer tokens than a thousand individually encoded observations +bayesian uncertainty around it.

It sounds unbelievable, but if instead of asking "let's test if this is real" we asked more "how do I make this real" I think we could discover that many obstacles are actually implementation details, finding the right schedule, hyperparameters and policies. Hoping to discuss this more in detail here before I get training. Cheers

reddit.com
u/ryunuck — 15 hours ago