But yeah assume we have the meta-level thing. It’s not that the cognition of the system is mysteriously failing; it’s that it is knowingly averse to deception and to thinking about how it can ‘get around’ or otherwise undermine this aversion.
[...]
What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to “OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way.”
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don’t actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures “specified completely once at for all at some point during training” (that is, that don’t become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of “the AI pursuing some goal”, only now the goal, instead of being pointed outwards to the world (“maximize red boxes in our galaxy”) is explicitly pointed inwards and reflectively to the AI’s mechanism (“minimize the amount of such cognitive processes happening inside you”). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don’t know how to do, “the whole problem”.
I still have hope that the ‘robust non-deceptiveness’ thing I’ve been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
I get it you’re trying to express that you’re doubtful of that last opinion of (my model of) Nate’s: you think this goal is so natural that in fact it will be easier to point at than most other goals we’d need to solve alignment (like low-impact or human values).
I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don’t actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures “specified completely once at for all at some point during training” (that is, that don’t become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of “the AI pursuing some goal”, only now the goal, instead of being pointed outwards to the world (“maximize red boxes in our galaxy”) is explicitly pointed inwards and reflectively to the AI’s mechanism (“minimize the amount of such cognitive processes happening inside you”). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don’t know how to do, “the whole problem”.
I get it you’re trying to express that you’re doubtful of that last opinion of (my model of) Nate’s: you think this goal is so natural that in fact it will be easier to point at than most other goals we’d need to solve alignment (like low-impact or human values).
I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.