First, it seems worthwhile to try taboo-ing the word ‘deception’ and see whether the process of building precision to re-define it clears up some of the confusion. In particular, it seems like there’s some implicit theory-of-mind stuff going on in the post and in some of the comments. I’m interested if you think the concept of ‘deception’ in this post only holds when there is implicit theory-of-mind going on, or otherwise.
As a thought experiment for a non-theory-of-mind example, let’s say the daemon doesn’t really understand why it gets a high reward for projecting the image of teapot (and then also doing some tactile projection somehow later) but it thinks that this is a good way to get high reward / meet its goals / etc. It doesn’t realize (or at least not in an obviously accessible way, “know”) that there is another agent/observer who is updating their models as a result of this. Possibly, if it did know this, it would not project the teapot, because it has another component of its reward function of “don’t project false stimulus to other observing agents”.
In this thought experiment, is the daemon ‘deceiving’ the observer? In general, is it possible to deceive someone who you don’t realize is there? (Perhaps they’re hiding behind a screen, and to you it just looks like you’re projecting a teapot to an empty audience)
Aside: I think there’s some interesting alignment problems here, but a bunch of our use of language around the concepts of deception hasn’t updated to a world where we’re talking about AI agents.
I think there’s also some theory of mind used by us outside observers when we label some events “deception.”
We’re finding it easy to predict what the “evil demon” will do by using a simple model that treats it as an agent trying to affect the beliefs of another agent. If the stuff in the environment didn’t reward the intentional stance like this, we wouldn’t see the things as agents, let alone see them as doing deception.
Thoughts:
First, it seems worthwhile to try taboo-ing the word ‘deception’ and see whether the process of building precision to re-define it clears up some of the confusion. In particular, it seems like there’s some implicit theory-of-mind stuff going on in the post and in some of the comments. I’m interested if you think the concept of ‘deception’ in this post only holds when there is implicit theory-of-mind going on, or otherwise.
As a thought experiment for a non-theory-of-mind example, let’s say the daemon doesn’t really understand why it gets a high reward for projecting the image of teapot (and then also doing some tactile projection somehow later) but it thinks that this is a good way to get high reward / meet its goals / etc. It doesn’t realize (or at least not in an obviously accessible way, “know”) that there is another agent/observer who is updating their models as a result of this. Possibly, if it did know this, it would not project the teapot, because it has another component of its reward function of “don’t project false stimulus to other observing agents”.
In this thought experiment, is the daemon ‘deceiving’ the observer? In general, is it possible to deceive someone who you don’t realize is there? (Perhaps they’re hiding behind a screen, and to you it just looks like you’re projecting a teapot to an empty audience)
Aside: I think there’s some interesting alignment problems here, but a bunch of our use of language around the concepts of deception hasn’t updated to a world where we’re talking about AI agents.
I think there’s also some theory of mind used by us outside observers when we label some events “deception.”
We’re finding it easy to predict what the “evil demon” will do by using a simple model that treats it as an agent trying to affect the beliefs of another agent. If the stuff in the environment didn’t reward the intentional stance like this, we wouldn’t see the things as agents, let alone see them as doing deception.