Thanks for the comment. It definitely pointed out some things that weren’t clear in my post and head. Some comments: 1. I think your section on psychologizing is fairly accurate. I previously didn’t spend a lot of time thinking about how my research would reduce the risks I care about and my theories of change were pretty vague. I plan to change that now. 2. I am aware of other failure modes such as fast takeoffs, capability gains in deployment, getting what we measure, etc. However, I feel like all of these scenarios get much worse/harder when deception is at play, e.g. fast takeoffs are worse when they are unnoticed and getting what we measure likely leads to worse outcomes if it is hidden. I would really think of them as orthogonal, e.g. getting what we measure could happen in a deceptive or non-deceptive way. But I’m not sure this is a correct framing. 3. It is correct that my definition of deception is inconsistent throughout the article. Thanks for pointing this out. I think it is somewhere between “It’s bad if something happens in powerful AIs that we don’t understand” to “It’s bad if there is an active adversary trying to deceive us”. I’ll need to think about this for longer. 4. Unknown unknowns are a problem. I think my claim as presented in the post is stronger than iI originally intended. However, I think the usefulness of foundational research such as yours comes to a large extent from the fact that it increases our understanding of an AI system in general which then allows us to prevent failure modes (many of which relate to deception).
I’ll try to update the post to reflect some of the discussion and my uncertainty better. Thanks for the feedback.
Thanks for the comment. It definitely pointed out some things that weren’t clear in my post and head. Some comments:
1. I think your section on psychologizing is fairly accurate. I previously didn’t spend a lot of time thinking about how my research would reduce the risks I care about and my theories of change were pretty vague. I plan to change that now.
2. I am aware of other failure modes such as fast takeoffs, capability gains in deployment, getting what we measure, etc. However, I feel like all of these scenarios get much worse/harder when deception is at play, e.g. fast takeoffs are worse when they are unnoticed and getting what we measure likely leads to worse outcomes if it is hidden. I would really think of them as orthogonal, e.g. getting what we measure could happen in a deceptive or non-deceptive way. But I’m not sure this is a correct framing.
3. It is correct that my definition of deception is inconsistent throughout the article. Thanks for pointing this out. I think it is somewhere between “It’s bad if something happens in powerful AIs that we don’t understand” to “It’s bad if there is an active adversary trying to deceive us”. I’ll need to think about this for longer.
4. Unknown unknowns are a problem. I think my claim as presented in the post is stronger than iI originally intended. However, I think the usefulness of foundational research such as yours comes to a large extent from the fact that it increases our understanding of an AI system in general which then allows us to prevent failure modes (many of which relate to deception).
I’ll try to update the post to reflect some of the discussion and my uncertainty better. Thanks for the feedback.