I made some notes as I read the winner. Is Tom Everitt on here? Should I send these somewhere?
2.2: The description of ξ is a bit odd—it seems like w_ν is your actual prior over models.
2.4: I’m not totally sold on some parts where it seems like you equate agents and policies. I think the thing we want to compare is not the difference in utility given the same policy, but the difference in true utility between policies that are chosen.
If the human and the AI have the same utility-maximizing policy, they can have nonzero “misalignment” if they have different standard deviations in utility. Meanwhile, even if there is zero misalignment on policy π, that might not be the utility-maximizing policy for the AI.
Should footnote 1 read CA rather than MA?
3.5: It seems like an agent is incentivized to create the extra observation systems, and then encode the important facts about the world within delusional high-reward states of the original observation channel, even assuming it still has to use the original observation and action channels. But it might not even require that facade—a corruption-aware system might notice that a policy that involves changing its hardware and how it uses it will get it larger rewards even according to the original utility function over percepts—beneficial corruption, if you will.
5.5: Decoupled data is nice, but I’m concerned that it’s not fully addressing the issue it gets brought up to address, which is fundamentally the issue of induction—how to predict reward for novel situations. An agent can learn as much as it wants about human preferences over normal states of the world, you can even tell it “stories” about unseen states, but it can still be misaligned on novel states never mentioned by humans. The issue is not just about the data the agent sees, but about how it interprets it.
5.5.3: For the counterfactual “default” policy, there are still some unsolved problems with matching human intuition. The classic example is an agent whose default policy is no output—and so it thinks that the default state of the world is the programmers acting confused and trying to fix the agent, because that’s what happens with the default policy.
5.6: (thinking about fig. 5.2) I’m hesitant to call any of these paths aligned, and I think it has to do with the difference between good agent design and successful learning of human values. This is a bit related to the inductive problem that I don’t think decoupled data fully addresses—it’s possible for one to code up something that tries to learn from humans, and successfully learns and faithfully maximizes some value function, but that value function is not the human value function. But this problem is quite ill-understood—most current work on it focuses on “corrigibility” in the intuitive sense of being willing to defer to humans even after you think you’ve learned human preferences.
I made some notes as I read the winner. Is Tom Everitt on here? Should I send these somewhere?
2.2: The description of ξ is a bit odd—it seems like w_ν is your actual prior over models.
2.4: I’m not totally sold on some parts where it seems like you equate agents and policies. I think the thing we want to compare is not the difference in utility given the same policy, but the difference in true utility between policies that are chosen.
If the human and the AI have the same utility-maximizing policy, they can have nonzero “misalignment” if they have different standard deviations in utility. Meanwhile, even if there is zero misalignment on policy π, that might not be the utility-maximizing policy for the AI.
Should footnote 1 read CA rather than MA?
3.5: It seems like an agent is incentivized to create the extra observation systems, and then encode the important facts about the world within delusional high-reward states of the original observation channel, even assuming it still has to use the original observation and action channels. But it might not even require that facade—a corruption-aware system might notice that a policy that involves changing its hardware and how it uses it will get it larger rewards even according to the original utility function over percepts—beneficial corruption, if you will.
5.5: Decoupled data is nice, but I’m concerned that it’s not fully addressing the issue it gets brought up to address, which is fundamentally the issue of induction—how to predict reward for novel situations. An agent can learn as much as it wants about human preferences over normal states of the world, you can even tell it “stories” about unseen states, but it can still be misaligned on novel states never mentioned by humans. The issue is not just about the data the agent sees, but about how it interprets it.
5.5.3: For the counterfactual “default” policy, there are still some unsolved problems with matching human intuition. The classic example is an agent whose default policy is no output—and so it thinks that the default state of the world is the programmers acting confused and trying to fix the agent, because that’s what happens with the default policy.
5.6: (thinking about fig. 5.2) I’m hesitant to call any of these paths aligned, and I think it has to do with the difference between good agent design and successful learning of human values. This is a bit related to the inductive problem that I don’t think decoupled data fully addresses—it’s possible for one to code up something that tries to learn from humans, and successfully learns and faithfully maximizes some value function, but that value function is not the human value function. But this problem is quite ill-understood—most current work on it focuses on “corrigibility” in the intuitive sense of being willing to defer to humans even after you think you’ve learned human preferences.
Thanks for your comments, much appreciated! I’m currently in the middle of moving continents, will update the draft within a few weeks.