I think the axis you’re missing is “works when humans can verify the output --> works when we can’t,” unless you meant it as part of your first axis. It’s easy to imagine a corrigible and very creative system being asked to do some complex task, and that task involving very illegal and dangerous steps along the way that are hard to see. Constitutional AI might avoid this (hard to prove though!) but RLHF seems just unavoidably non-scalable, and to OpenAI’s credit they seem to be aware of it.

Expand full comment

I don't understand how an AI could be behaviorally safe in terms of:

(A) Breaking the law. Law is a moving target with fuzzy boundaries that need to be determined by precedents. And of course we have to ask "whose law": this or the neighboring nation, state, locality? Will the AI be able to determine which laws have been abolished, supplanted, etc? And what about conflicting interpretations of the law? And Asimovian loopholes?

(B) Promote extremism or hate. How could you automate such a test? What is the subjective standpoint from which such a judgement could be made? What about conflicting subjective standpoints?

Expand full comment