I like this comment! I’m sorta treating it like a game-tree exercise, hope that’s okay.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn’t get classified as deception but achieves similar results, it’ll go for it. But this is unlikely in practice because it won’t be spending cognition trying to find such plans, because doing so would trigger the aversion.
I don’t think I agree. I think that your system is very likely going to be applying some form of “rigorously search the solution space for things that work to solve this problem” for all the problems it encounters. If it’s not doing that, then I’m not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I’m understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator.
Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn’t pass this check.
If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for “cognitive plan likely to lead to deceptive thoughts” a bit lower, zooming in to maybe “rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans”. This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we’re worried about our AI engaging in search-over-plan strategies which aren’t directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way.
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it’s possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn’t be fair to characterize it as “rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception” but rather as “rigorously search the solution space for things that work to solve this problem without being deceptive” This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it’s global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but ‘gaming the deception classifier’ would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier—it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.
Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn’t break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: “Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren’t going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup.” Someone else: “Hmm, but isn’t that just a way to get around our constraints? Seems bad to me. We shouldn’t do that unless we have a way to also verify that the node-path doesn’t involve asking the contractor to break the rules.”
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
I like this comment! I’m sorta treating it like a game-tree exercise, hope that’s okay.
I don’t think I agree. I think that your system is very likely going to be applying some form of “rigorously search the solution space for things that work to solve this problem” for all the problems it encounters. If it’s not doing that, then I’m not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I’m understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator.
Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn’t pass this check.
If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for “cognitive plan likely to lead to deceptive thoughts” a bit lower, zooming in to maybe “rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans”. This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we’re worried about our AI engaging in search-over-plan strategies which aren’t directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way.
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it’s possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn’t be fair to characterize it as “rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception” but rather as “rigorously search the solution space for things that work to solve this problem without being deceptive” This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it’s global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but ‘gaming the deception classifier’ would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier—it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.
Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn’t break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: “Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren’t going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup.” Someone else: “Hmm, but isn’t that just a way to get around our constraints? Seems bad to me. We shouldn’t do that unless we have a way to also verify that the node-path doesn’t involve asking the contractor to break the rules.”
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
To be clear I’m not sure this is possible, it may be fundamentally confused.