I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.
I think we are in agreement here, more or less. There’s an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam’s Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I’m saying that the nonconsequentialist/internal goals don’t seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist ‘avoid being deceitful’ goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn’t smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of ‘get highest possible grades’ might get caught cheating and end up with worse grades than if they had the goal of ‘get highest possible grades and also avoid cheating.’)
So I’m saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we’d probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate’s view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.