This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.
What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as ‘think long and hard about how to get similar results as deception without technically counting as deception’ would be classified as deceptive and heavily downweighted)?
...attempting to answer my own question...
The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)
But yeah assume we have the meta-level thing. It’s not that the cognition of the system is mysteriously failing; it’s that it is knowingly averse to deception and to thinking about how it can ‘get around’ or otherwise undermine this aversion.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn’t get classified as deception but achieves similar results, it’ll go for it. But this is unlikely in practice because it won’t be spending cognition trying to find such plans, because doing so would trigger the aversion.
I guess you’d say that as the system gets generally smarter, it becomes likely in practice, because it’ll just be doing things like “apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions” and this will be a distribution shift for the deception-classifier so it’ll fail, even though at no point was the system intending to make the deception-classifier stay silent… But what if it isn’t a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to “OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way.”
… I’m not sure what to think but I still have hope that the ‘robust nondeceptiveness’ thing I’ve been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
But yeah assume we have the meta-level thing. It’s not that the cognition of the system is mysteriously failing; it’s that it is knowingly averse to deception and to thinking about how it can ‘get around’ or otherwise undermine this aversion.
[...]
What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to “OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way.”
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don’t actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures “specified completely once at for all at some point during training” (that is, that don’t become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of “the AI pursuing some goal”, only now the goal, instead of being pointed outwards to the world (“maximize red boxes in our galaxy”) is explicitly pointed inwards and reflectively to the AI’s mechanism (“minimize the amount of such cognitive processes happening inside you”). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don’t know how to do, “the whole problem”.
I still have hope that the ‘robust non-deceptiveness’ thing I’ve been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
I get it you’re trying to express that you’re doubtful of that last opinion of (my model of) Nate’s: you think this goal is so natural that in fact it will be easier to point at than most other goals we’d need to solve alignment (like low-impact or human values).
I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.
By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don’t want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem.
Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right.
Fails because you don’t have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you’re back in the place you started, where now your model is doing a whole bunch of learning about the world but there’s a huge hole in it’s understanding (and permitted search) around “what’s going on in human minds” because you blocked off this search space.
Maybe this kinda works or something, but I doubt it. We’ve pretty much bumped into STEM AI.
Unfortunately, I don’t think I engaged with the best version of the thing you’re suggesting in your comment.
I like this comment! I’m sorta treating it like a game-tree exercise, hope that’s okay.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn’t get classified as deception but achieves similar results, it’ll go for it. But this is unlikely in practice because it won’t be spending cognition trying to find such plans, because doing so would trigger the aversion.
I don’t think I agree. I think that your system is very likely going to be applying some form of “rigorously search the solution space for things that work to solve this problem” for all the problems it encounters. If it’s not doing that, then I’m not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I’m understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator.
Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn’t pass this check.
If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for “cognitive plan likely to lead to deceptive thoughts” a bit lower, zooming in to maybe “rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans”. This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we’re worried about our AI engaging in search-over-plan strategies which aren’t directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way.
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it’s possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn’t be fair to characterize it as “rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception” but rather as “rigorously search the solution space for things that work to solve this problem without being deceptive” This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it’s global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but ‘gaming the deception classifier’ would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier—it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.
Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn’t break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: “Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren’t going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup.” Someone else: “Hmm, but isn’t that just a way to get around our constraints? Seems bad to me. We shouldn’t do that unless we have a way to also verify that the node-path doesn’t involve asking the contractor to break the rules.”
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.
What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as ‘think long and hard about how to get similar results as deception without technically counting as deception’ would be classified as deceptive and heavily downweighted)?
...attempting to answer my own question...
The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)
But yeah assume we have the meta-level thing. It’s not that the cognition of the system is mysteriously failing; it’s that it is knowingly averse to deception and to thinking about how it can ‘get around’ or otherwise undermine this aversion.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn’t get classified as deception but achieves similar results, it’ll go for it. But this is unlikely in practice because it won’t be spending cognition trying to find such plans, because doing so would trigger the aversion.
I guess you’d say that as the system gets generally smarter, it becomes likely in practice, because it’ll just be doing things like “apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions” and this will be a distribution shift for the deception-classifier so it’ll fail, even though at no point was the system intending to make the deception-classifier stay silent… But what if it isn’t a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to “OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way.”
… I’m not sure what to think but I still have hope that the ‘robust nondeceptiveness’ thing I’ve been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don’t actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures “specified completely once at for all at some point during training” (that is, that don’t become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of “the AI pursuing some goal”, only now the goal, instead of being pointed outwards to the world (“maximize red boxes in our galaxy”) is explicitly pointed inwards and reflectively to the AI’s mechanism (“minimize the amount of such cognitive processes happening inside you”). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don’t know how to do, “the whole problem”.
I get it you’re trying to express that you’re doubtful of that last opinion of (my model of) Nate’s: you think this goal is so natural that in fact it will be easier to point at than most other goals we’d need to solve alignment (like low-impact or human values).
I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.
Another attempted answer:
By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don’t want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem.
Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right.
Fails because you don’t have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you’re back in the place you started, where now your model is doing a whole bunch of learning about the world but there’s a huge hole in it’s understanding (and permitted search) around “what’s going on in human minds” because you blocked off this search space.
Maybe this kinda works or something, but I doubt it. We’ve pretty much bumped into STEM AI.
Unfortunately, I don’t think I engaged with the best version of the thing you’re suggesting in your comment.
I like this comment! I’m sorta treating it like a game-tree exercise, hope that’s okay.
I don’t think I agree. I think that your system is very likely going to be applying some form of “rigorously search the solution space for things that work to solve this problem” for all the problems it encounters. If it’s not doing that, then I’m not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I’m understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator.
Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn’t pass this check.
If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for “cognitive plan likely to lead to deceptive thoughts” a bit lower, zooming in to maybe “rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans”. This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we’re worried about our AI engaging in search-over-plan strategies which aren’t directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way.
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it’s possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn’t be fair to characterize it as “rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception” but rather as “rigorously search the solution space for things that work to solve this problem without being deceptive” This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it’s global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but ‘gaming the deception classifier’ would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier—it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.
Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn’t break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: “Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren’t going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup.” Someone else: “Hmm, but isn’t that just a way to get around our constraints? Seems bad to me. We shouldn’t do that unless we have a way to also verify that the node-path doesn’t involve asking the contractor to break the rules.”
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
To be clear I’m not sure this is possible, it may be fundamentally confused.
You could also downweight plans that are too far from any precedent.