The limbic system that controls motivations such as the sex drive is much older than the relatively new neocortex that’s responsible for human intelligence.
My guess is that the limbic system evolved by trial and error over millions of years. If this is what happened, maybe we should seek out iterative methods for aligning AI systems such as iteratively testing and developing the motivations of sub-AGI systems.
But as Eliezer Yudkowsky says in his AGI Ruin post, you can’t iteratively develop an AGI that’s operating at dangerous levels of capability if each mistake kills you. Therefore we might need to extrapolate the motivations of sub-AGI systems to superhuman systems or solve the problem in advance using a theoretical approach.
Sure the limbic system evolved over millions of years, but that doesn’t mean we need to evolve it as well—we could just study it and reimplement it directly without (much) iteration. I am not necessarily saying that this is a good approach to alignment—I personally would prefer a more theoretically grounded one also. But I think it is an interesting existence proof that imprinting fairly robust drives into agents through a very low bandwidth channel even after a lot of experience and without much RL is possible in practice.
The limbic system that controls motivations such as the sex drive is much older than the relatively new neocortex that’s responsible for human intelligence.
My guess is that the limbic system evolved by trial and error over millions of years. If this is what happened, maybe we should seek out iterative methods for aligning AI systems such as iteratively testing and developing the motivations of sub-AGI systems.
But as Eliezer Yudkowsky says in his AGI Ruin post, you can’t iteratively develop an AGI that’s operating at dangerous levels of capability if each mistake kills you. Therefore we might need to extrapolate the motivations of sub-AGI systems to superhuman systems or solve the problem in advance using a theoretical approach.
Sure the limbic system evolved over millions of years, but that doesn’t mean we need to evolve it as well—we could just study it and reimplement it directly without (much) iteration. I am not necessarily saying that this is a good approach to alignment—I personally would prefer a more theoretically grounded one also. But I think it is an interesting existence proof that imprinting fairly robust drives into agents through a very low bandwidth channel even after a lot of experience and without much RL is possible in practice.