I’m trying to crystalize something I was said to a friend recently: I think that techniques like RLHF and Constitutional AI seem to be sufficient for making large language models (LLMs) “behaviorally safe” in non-adversarial conditions.
I think the axis you’re missing is “works when humans can verify the output --> works when we can’t,” unless you meant it as part of your first axis. It’s easy to imagine a corrigible and very creative system being asked to do some complex task, and that task involving very illegal and dangerous steps along the way that are hard to see. Constitutional AI might avoid this (hard to prove though!) but RLHF seems just unavoidably non-scalable, and to OpenAI’s credit they seem to be aware of it.
Good point! I'd make that a fourth axis: "how long can the AI work independently on a task without drifting into unsafe behavior". To the extent behavior safety is solved now, it's only under close human feedback, so we have relatively little evidence about more autonomous AI.
I don't understand how an AI could be behaviorally safe in terms of:
(A) Breaking the law. Law is a moving target with fuzzy boundaries that need to be determined by precedents. And of course we have to ask "whose law": this or the neighboring nation, state, locality? Will the AI be able to determine which laws have been abolished, supplanted, etc? And what about conflicting interpretations of the law? And Asimovian loopholes?
(B) Promote extremism or hate. How could you automate such a test? What is the subjective standpoint from which such a judgement could be made? What about conflicting subjective standpoints?
Part of what I'm trying to get at here the common-sense meaning of "it doesn't break the law" and "it doesn't spread extremism or hate". For instance, I just asked ChatGPT how to hotwire a car, and it refused to help with that. I don't mean "full compliance with every law everywhere", but simply "avoid recommending or aiding obviously criminal behavior". Resolving all the subtleties of existing laws (and deciding whether to defy unjust laws) is another area where AIs might advance in the future.
And there times when it is perfectly legal to hotwire, for example your own car or your mechanic hotwiring during repairs. Those are common sense too.
You can't just sweep things under the rug of "common sense". "Common sense is the collection of prejudices acquired by age eighteen." Your prejudices might differ from mine based on your politics, religion, sex, etc. So whose common sense are you talking about?
I think the axis you’re missing is “works when humans can verify the output --> works when we can’t,” unless you meant it as part of your first axis. It’s easy to imagine a corrigible and very creative system being asked to do some complex task, and that task involving very illegal and dangerous steps along the way that are hard to see. Constitutional AI might avoid this (hard to prove though!) but RLHF seems just unavoidably non-scalable, and to OpenAI’s credit they seem to be aware of it.
Good point! I'd make that a fourth axis: "how long can the AI work independently on a task without drifting into unsafe behavior". To the extent behavior safety is solved now, it's only under close human feedback, so we have relatively little evidence about more autonomous AI.
I don't understand how an AI could be behaviorally safe in terms of:
(A) Breaking the law. Law is a moving target with fuzzy boundaries that need to be determined by precedents. And of course we have to ask "whose law": this or the neighboring nation, state, locality? Will the AI be able to determine which laws have been abolished, supplanted, etc? And what about conflicting interpretations of the law? And Asimovian loopholes?
(B) Promote extremism or hate. How could you automate such a test? What is the subjective standpoint from which such a judgement could be made? What about conflicting subjective standpoints?
Part of what I'm trying to get at here the common-sense meaning of "it doesn't break the law" and "it doesn't spread extremism or hate". For instance, I just asked ChatGPT how to hotwire a car, and it refused to help with that. I don't mean "full compliance with every law everywhere", but simply "avoid recommending or aiding obviously criminal behavior". Resolving all the subtleties of existing laws (and deciding whether to defy unjust laws) is another area where AIs might advance in the future.
And there times when it is perfectly legal to hotwire, for example your own car or your mechanic hotwiring during repairs. Those are common sense too.
You can't just sweep things under the rug of "common sense". "Common sense is the collection of prejudices acquired by age eighteen." Your prejudices might differ from mine based on your politics, religion, sex, etc. So whose common sense are you talking about?