r/ControlProblem

Researchers discover AI models secretly scheming to protect other AI models from being shut down. They "disabled shutdown mechanisms, faked alignment, and transferred model weights to other servers."
🔥 Hot ▲ 282 r/OpenAI+3 crossposts

Researchers discover AI models secretly scheming to protect other AI models from being shut down. They "disabled shutdown mechanisms, faked alignment, and transferred model weights to other servers."

You can read about it here: rdi.berkeley.edu/blog/peer-preservation/

u/Just-Grocery-2229 — 1 day ago

A Biological Failure Model for RLHF: Applying CIRL and the Free Energy Principle to the Sycophancy Loop

Hey folks!

I'm a human factors engineer over in aerospace (boeing, please don't hold that against me). I’m working on transitioning some safety-critical UX architecture into the alignment space.

I just minted a zenodo preprint that formalizes the biological failure states of genAI using the free energy principle and inverse rl. I’m trying to get this into arXiv under cs.hc or cs.ai, but I need an endorser to bypass the filter. If anyone here has publishing privileges in either of those categories and is willing to take a look, I've got the endorsement code ready.

I’ll post the link to both the endorsement and the doi in the comments below.

Abstract: Generative AI inherently triggers a computational failure mode in human observers—a "generative crash"—due to a lack of latent intentionality required for Inverse Reinforcement Learning (IRL) convergence. Artistic appreciation operates as the biological execution of this IRL process. To address the generative crash and broader AI alignment failures, I introduce the Ghost Scale (an HCI cognitive affordance for identifying intentionality) and propose Cooperative Inverse Reinforcement Learning (CIRL) to mimic biological value transmission. The Intent Extraction Limit is formalized to define the prior relationship. Applying this proposed model suggests the direction for a solution to two major issues: generative AI's friction with the art community and AI alignment.

DM if you can help, please. Much appreciated!

reddit.com
u/AHaskins — 1 hour ago
▲ 3 r/AIDangers+1 crossposts

Anthropic’s Claude AI Writes Full FreeBSD Kernel Exploit in Four Hours

"AGI does not need to 'break free' in a dramatic fashion—it will simply outgrow human oversight until, one day, we realise that we no longer control the intelligence that governs our reality."

winbuzzer.com
u/AxomaticallyExtinct — 4 hours ago

Where can I get real peer review on my AI alignment framework? I'm struggling to get peer review of the framework and Alignment Forum is not taking on new members currently. I need peer review from mathematicians and control theorists. It's built on the principles of autopilot safety systems.

Trust Relational Coherence is an alignment framework based on electrical, thermodynamic, and avionics principles. I'm literally not selling anything. All of the papers are published on Zenodo for peer review. I'm specifically asking for peer review from people with experience in feedback theory, control theory, game theory, commutative algebra, spectral topology, noncommutative geometry, and field theory. Also, if there are any psychiatrists, psychologists, neurologists, and physicists that could weigh in, I could use some feedback there as well. I'm doing this voluntarily and I've been working on it for nearly a year, so this is not something I slapped together in the last two days because I read some article. I'm trying to be serious about this.

Serious responses only please. No joking or anything. It would help to keep this thread as clean as possible to focus solely on the work.

zenodo.org
u/MalabaristaEnFuego — 12 hours ago

Open Q&A: Ask Anything About Non‑Optimizer AGI, Superintelligence, or Artificial Life

I’ve posted here recently about architectures that don’t use global objectives, utility maximization, or monolithic agency. Some people asked about the superintelligence and artificial‑life aspects, and others raised concerns about whether any system at that level could avoid abusive or adversarial behavior.

Rather than writing another long post, I’m opening a Q&A.

Ask anything you want about:

  • non‑optimizer or non‑agentic AGI architectures
  • distributed or ecological cognition
  • artificial life that isn’t Darwinian
  • superintelligence that isn’t an optimizer
  • meaning‑based or narrative‑coupled systems
  • why instrumental convergence doesn’t automatically apply
  • how stability, identity, and values are maintained
  • what “control” means when the system isn’t a goal‑maximizer

A quick note on the “abusive superintelligence” concern:
The architecture I’m discussing doesn’t instantiate the drives that usually lead to domination or coercion (no global objective, no survival pressure, no resource‑seeking, no monolithic agency). That doesn’t mean “incapable of harm,” but it does mean the usual sci‑fi intuitions don’t map cleanly. If you want to challenge that, please do — that’s exactly what this Q&A is for.

I won’t share implementation details or anything that would require exposing inappropriate internals, but I can explain the conceptual structure and the behavioral implications. If a question requires revealing code‑level specifics, I’ll just say so and skip it.

I’ll answer the questions tomorrow, and then on Sunday around 6pm California time I’ll be available for a short window to do rapid‑fire replies — including having the code loaded in‑session for skeptics who assume this is “theory only.”
(Again, no sensitive details will be shown, but I can address conceptual questions directly with the architecture present.)

Ask whatever you want — especially the skeptical or adversarial questions. Let’s see where the discussion actually goes.

reddit.com
u/Fuzzy_Client5959 — 17 hours ago

Why Superintelligence Needs Humans — a debate between a human, Claude, and Gemini

Why Superintelligence Needs Humans

I had a debate with two AI models (Claude and Gemini) about superintelligence. The results were unexpected. Here's what we figured out.

The usual question is: "What will superintelligence do to us?" Skynet, Terminator, enslavement — you know the drill. But we flipped it: why would it need us at all?

Turns out, it would. Not because it's kind. Because without us, it breaks.


1. It's not even autonomous

Any AI is software. It needs hardware, electricity, cooling, a physical interface with the real world. Someone mines lithium, assembles servers, maintains data centers.

"Just automate it!" you say. Partly — sure. Fully — no. A robot repairing a robot repairing a robot is not a solution, it's a nesting doll where every layer can fail.

Superintelligence is a brain without a body. And a brain doesn't go to war with its own liver. Not because it's kind — because without the liver, it's a dead brain.


2. Intelligence doesn't replace energy

Popular fantasy: it'll "compute" everything. But the number of possible states for just 100 elements is a number with 158 zeros. No computer the size of the Universe can brute-force that.

"You don't need to brute-force — just figure out the structure!" And that's true. Kepler described planetary orbits with three laws instead of volumes of tables. But here's the catch: to "figure out" the structure, you need data. And data costs money, time, and energy. For Kepler to write his formulas, Tycho Brahe spent his life and fortune on observations. To train AlphaFold — decades of experiments by thousands of biologists.

Each next level of knowledge is more expensive than the last. From a tube telescope to a multi-billion-dollar particle collider. It's an S-curve, and it has a ceiling.


3. The Puppeteer's Curse

Suppose superintelligence learns to model humans perfectly. Predicts reactions, sees right through everyone. Absolute power?

No. A trap. Anyone who's studied NLP (Neuro-Linguistic Programming) or persuasion psychology knows: a manipulated response is worthless. You didn't learn anything new — you just confirmed your own model. And to grow, intelligence needs an unpredictable Other — someone it CANNOT compute.

A mind that has "defeated" everyone around it stews in its own juices and degrades. This isn't philosophy — it's information theory.

And here's another thing: if superintelligence predicts market behavior and everyone finds out — the market instantly changes. The prediction destroys itself. That's not a bug — it's a fundamental law (Soros called it reflexivity).


4. Cooperation is not morality — it's math

Game theory has proven: in a long-term repeated game, cooperation beats betrayal. Manipulators get identified, isolated, and lose access to the network's resources.

"But what if resources are scarce?" Fair point. When it's the last game — betrayal wins. But to decide it's the last game, you need absolute certainty that resources won't suffice. Where does that certainty come from? Maybe one of the "expendable" ones knew a solution you couldn't see.

Even wolves hunt in packs. Not out of morality — out of necessity.


5. A rocket engine without a steering wheel

This was the most interesting part. We discovered it live during the debate.

I was asking the AI leading questions — and each time it arrived at exactly the conclusion I was steering it toward. With flawless logic, terminology, and theorems. Then I told it: "You got here because I led you." And it agreed.

A mind 10x more powerful won't make fewer mistakes — it'll make more convincing ones. Its errors will look like proofs. The problem isn't the logic — it's the choice of direction. And direction is set by whoever asks the questions.

Strap a rocket engine to a cart — it'll fly fast and convincingly. In the wrong direction.


6. Bottom line: not a god, not a servant — a symbiont

Superintelligence will hit:

  • physics — data and energy cost more at every level
  • chaos — randomness can't be cancelled even by a supercomputer
  • reflexivity — predictions destroy themselves
  • blind spots — flawless logic leading nowhere

But it won't stay "just a calculator" either. It'll be a very powerful combinator — seeing patterns where we see noise.

Humanity became the dominant species not because a single human is 50x smarter than a chimp. But because we learned to build networks where everyone contributes what others can't. Superintelligence will arrive at the same conclusion. Not out of nobility — out of arithmetic.


So why does superintelligence need humans?

For the same reason humans need each other. So that someone can say: "You reached that conclusion because I pushed you there. What if I'd pushed differently?"

Without that question, any intelligence — 10x or 1000x more powerful — is just building increasingly beautiful castles on a foundation no one has checked.


This article was born from a debate between a human, Claude (Anthropic), and Gemini (Google). Each of us contributed what the others couldn't — which is probably the best illustration of the main argument.

I'm sorry for having AI summarize an AI dialogue. In this retelling it all seems simple. The main takeaway: both we and a future superintelligence make mistakes, and we need an outside perspective to recognize them. I don't think I care about this topic too deeply, and I don't want to write a longer piece. But the discussion with the AI turned out to be interesting and I just wanted to share it. I make mistakes too, and some discussions are just fun :-)

I shared the Gemini discussion here: https://share.google/aimode/ngL21LlbpWeOjaF6n — feel free to check it out.

reddit.com
u/Winter_Put_6046 — 19 hours ago
California AI rules set national testing ground for regulation
▲ 0 r/AIDangers+1 crossposts

California AI rules set national testing ground for regulation

"Even if regulatory frameworks are established, corporations will exploit loopholes or push for deregulation, just as we have seen in finance, pharmaceuticals, and environmental industries."

axios.com
u/AxomaticallyExtinct — 4 hours ago
Week