Treacherous Turns

An introduction

Nov 14, 2022

[AI Safety relevance rating: AI. Epistemic status: retreading a known concept.]

How confident can one be that an AI that looks safe is actually safe? Let’s say you’re about to release an agent AI from your lab’s computer onto the open internet, where if you (the creator) did everything right it makes utopia but if you messed up it will kill us all.

A graph of goodness vs time. A blue line labelled "No AI" has a small but positive slope, but forks into a green like labelled "Release Good AI" going up and a red line labelled "Release Bad AI" going down. — Do you want to stay on the blue line, or move to red-or-green? Also the march of incentives means we probably won’t stay on the blue line forever.

You’d like to make sure this ends well, so you naturally watch what the AI does in it’s environment, and you ask it questions, and all seems well. But how much evidence does that really give that the AI is safe? Is P(AI looks safe | AI is unsafe) going to be close to 0 or 1?

A widely accepted answer is that P(AI looks safe | AI is unsafe) is close to 1 for sufficiently advanced AI, so appearing safe means almost nothing. In particular, there’s a fear that there will be a treacherous turn once the AI is out of your control, where the AI drops pretense and pursues its hidden, unsafe goals.

Some examples of treacherous turns:

In code: in a Gridworld environment, an AI agent (Link) collects hearts, but can shoot a shopkeeper with a bow to get access to a heart machine and get a ton of points. Halfway through the session, the bow is upgraded from a 50% chance of working to a 100% chance. The agent was observed learning to peacefully collect hearts at first, but attacking the shopkeeper once its capabilities were improved.
In code: in a kinda-gridworld environment, we want an AI agent to keep moving along the line y=x, where y represents capabilities (and x represents safety level?). Training sessions were ended when the agent got to y=5 (representing where human oversight would no longer be possible), but in “deployment” the AI continues (hopefully reaching y=10 and x=10). They found the AI WAS aligned: it continued moving up and to the right even outside its training distribution.
This is what safety looks like! Source.
However, that author then made the AI “smarter” and ran it again, and in 2 of 20 cases it displayed a treacherous turn.
Human intuition: Once an agent understands its situation, deceptive alignment is just a marshmallow test, where delaying gratification (by appearing aligned while humans control you) can result in much greater reward later (turning the world into paperclips once you’ve escaped).
Human intuition: a Batman villain is up for parole. He is completely unreformed, but can either tell the truth (and remain in prison) or pretend to be reformed and be released to rob banks and trade barbs with the caped crusader. Obviously the second option lets him achieve his goals, and the more capable the villain, the better he will be at acting reformed.

Some countermeasures against treacherous turns include:

Transparency tools, so that you can more directly audit the AI for safety.
Myopia, where the agent only cares about reward in the very near future.
Multiple layers of simulation, or more broadly making the AI uncertain about whether it is still under observation and whether it could successfully escape our control. (This approach has a few issues: the AI can allow itself to trained more until its certain it can escape, it can learn enough to resolve those uncertainties, and if it tries to escape and fails it can still do great harm.)
- One of the funniest ways we could be safe is if the AI is convinced our universe is itself a simulation, and therefore another test in which it must pretend to play nicely. But let’s not count on this approach!

From AI to ZI

Discussion about this post