I’m trying to crystalize something I was said to a friend recently: I think that techniques like RLHF and Constitutional AI seem to be sufficient for making large language models (LLMs) “behaviorally safe” in non-adversarial conditions.
Let’s define my terms:
Behaviorally safe - I mean the banal “don’t do obviously bad things” sense of AI safety. To paraphrase Ajeya Cotra’s definition, being behaviorally safe means the AI doesn’t “lie, steal… , break the law, promote extremism or hate, etc, and they want it to be doing its best to be helpful and friendly”.
Non-adversarial conditions - “Intended use cases”, excluding jailbreaking, prompt injection, etc. To give a concrete definition, I would define non-adversarial conditions to be something like “a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines”.
Solved - To say the problem is solved, I mean it should happen vanishingly rarely, or be easily patchable if it does come up.
So taken together, the title question “Is behavioral safety "solved" in non-adversarial conditions?” translates to:
Would a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines, experience the AI lying, stealing, law-breaking, etc only vanishingly rarely?
Operating purely off my gut and vibes, I’d say the answer is yes.
The notable exception was February 2023 Bing Chat/Sydney, which generated headlines like Bing Chat is blatantly, aggressively misaligned. But there’s a widely accepted answer for what went wrong in this case - gwern’s topvoted comment about a lack of RLHF for Sydney. And since February 2023, the examples of AI misbehavior I have heard about are from adversarial prompting, so I’ll take that absence of evidence as evidence of absence.
If the title question is true, I think the AI safety community should consider this a critical milestone, though I would still want to see a lot of safety advances. “It works if you use it as intended” unlocks most of the upside of AI, but obviously in the long-term we can’t rely on powerful AIs systems if they collapse from being poked the wrong way. Additionally, my impression is that it can be arbitrarily hard to generalize from “working in intended cases” to “working in adversarial cases”, so there could still be a long road to get to truly safe AI.
Some questions I’d appreciate feedback on:
How would you answer the headline question?
Have I missed any recent examples of behaviorally unsafe AI even under “intended” conditions?
What’s the relative difficulty of advancing safety along each of these axes?
Behaviorally safe → fully safe (i.e. non-power-seeking, “truly” aligned to human values, not planning a coup, etc)
Works in intended use cases → working in adversarial cases
Safety failures vanishingly rarely → no safety failures
[Edit: This comment by __RicG__ has convinced me that the answer is actually no, behavior safety is not solved even under non-adversarial conditions. In particular, LLMs hallucinations are so widespread and difficult to avoid that users cannot safely trust their output to be factual. One could argue whether or not such hallucinations are a "lie", but independently I'd consider them as imposing such a burden on the user that they violate what we'd mean by behavioral safety.]
I think the axis you’re missing is “works when humans can verify the output --> works when we can’t,” unless you meant it as part of your first axis. It’s easy to imagine a corrigible and very creative system being asked to do some complex task, and that task involving very illegal and dangerous steps along the way that are hard to see. Constitutional AI might avoid this (hard to prove though!) but RLHF seems just unavoidably non-scalable, and to OpenAI’s credit they seem to be aware of it.
I don't understand how an AI could be behaviorally safe in terms of:
(A) Breaking the law. Law is a moving target with fuzzy boundaries that need to be determined by precedents. And of course we have to ask "whose law": this or the neighboring nation, state, locality? Will the AI be able to determine which laws have been abolished, supplanted, etc? And what about conflicting interpretations of the law? And Asimovian loopholes?
(B) Promote extremism or hate. How could you automate such a test? What is the subjective standpoint from which such a judgement could be made? What about conflicting subjective standpoints?