Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review—when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects—rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I’ll add a clarification of that at the top. Now to your comments:
To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you’re considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like.
On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?”—you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there.
On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn’t even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that’s why I spend so much time discussing relative credence in those models.
Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review—when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects—rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I’ll add a clarification of that at the top. Now to your comments:
To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you’re considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like.
On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?”—you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there.
On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn’t even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that’s why I spend so much time discussing relative credence in those models.