Most approaches work fine when the AI is weak and so is within the training distribution that we can imagine. All known approaches fail once the intelligence being trained is much more powerful than the trainers. An obvious flaw here is that the trainer has to realize that something is wrong and needs adjusting. While the idea that the agent should “not instrumentally-care about its goals” is a good… instrumental goal, it does not get us far. See the discussion about emergent mesaoptimizers, for example, where an emergent internal agent has an instrumental goal as a terminal. In general, if you have the security mindset, you should be able to find holes easily in, say, your first 10 alignment-oriented proposals.
Hi, sorry for commenting on ancient comment, but I just read it again and found that I’m not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn’t corrigible should be trained away as it hurt the utility as defined.
Most approaches work fine when the AI is weak and so is within the training distribution that we can imagine. All known approaches fail once the intelligence being trained is much more powerful than the trainers. An obvious flaw here is that the trainer has to realize that something is wrong and needs adjusting. While the idea that the agent should “not instrumentally-care about its goals” is a good… instrumental goal, it does not get us far. See the discussion about emergent mesaoptimizers, for example, where an emergent internal agent has an instrumental goal as a terminal. In general, if you have the security mindset, you should be able to find holes easily in, say, your first 10 alignment-oriented proposals.
Hi, sorry for commenting on ancient comment, but I just read it again and found that I’m not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn’t corrigible should be trained away as it hurt the utility as defined.