3 Comments

Also, in the Zelda example in the post, if we interpret the actions in the way you say (i.e. killing the shopkeeper is evil) then the AI is actually doing the opposite of betraying us when it makes its "treacherous" turn, right? Because around episode 3,000 it suddenly starts using the benevolent strategy, in addition to the violent one which it was using all along. (I realize this doesn't matter for the overall argument.)

Expand full comment

Nice post! Do you think this is a correct interpretation? "Qualitatively different strategies emerge at different levels of capability. Just because the strategies that emerge at lower capability levels don't disempower us doesn't mean the ones at higher capability levels aren't going to disempower us either."

Expand full comment

I think that's a correct and important insight about AI, just not the one I was trying to make in this post! I'd summarize this post as "the strategies that emerge are independent of our moral interpretation of the situation, unless we encode our morality in the reward function".

Expand full comment