By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don’t want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem.
Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right.
Fails because you don’t have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you’re back in the place you started, where now your model is doing a whole bunch of learning about the world but there’s a huge hole in it’s understanding (and permitted search) around “what’s going on in human minds” because you blocked off this search space.
Maybe this kinda works or something, but I doubt it. We’ve pretty much bumped into STEM AI.
Unfortunately, I don’t think I engaged with the best version of the thing you’re suggesting in your comment.
Another attempted answer:
By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don’t want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem.
Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right.
Fails because you don’t have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you’re back in the place you started, where now your model is doing a whole bunch of learning about the world but there’s a huge hole in it’s understanding (and permitted search) around “what’s going on in human minds” because you blocked off this search space.
Maybe this kinda works or something, but I doubt it. We’ve pretty much bumped into STEM AI.
Unfortunately, I don’t think I engaged with the best version of the thing you’re suggesting in your comment.