Thanks! This article was primarily addressing some specific concerns around alignment, such as Goodharting and models going out of distribution. The various people worried about those specific issues somehow never engaged with it, and the people with different concerns about alignment unsurprisingly weren’t interested.
This was basically me persuading myself these things weren’t important concerns. My thinking on alignment has moved on a bit since then. but I stand by the basic point: AGI is only able to FOOM if it can do STEM, and there are a number of alignment failure modes that a bunch of people on LessWrong have spent a bunch of time worrying about which have the property that anything that suffers from them clearly can’t do STEM reliably — therefore they’re not an x-risk. Such as Goodharting your optimizer, using models outside their region of validity, or not being able to cope with ontology paradigm shifts. Some people need to stop thinking about agents created by RL as their baseline x-risk, and start thinking about agents that can do STEM well instead. This additional assumption tells you more about the agent, making it less abstract and giving you more information to reason about its behavior.
More generally, for an agent that’s an AGI value learner, if you’re concerned about an alignment failure mode that seems obvious to you, you then need to explain why it plausibly wouldn’t also be obvious to or fixable by the agent before it left the basin of attraction of value learning.
Thanks! This article was primarily addressing some specific concerns around alignment, such as Goodharting and models going out of distribution. The various people worried about those specific issues somehow never engaged with it, and the people with different concerns about alignment unsurprisingly weren’t interested.
This was basically me persuading myself these things weren’t important concerns. My thinking on alignment has moved on a bit since then. but I stand by the basic point: AGI is only able to FOOM if it can do STEM, and there are a number of alignment failure modes that a bunch of people on LessWrong have spent a bunch of time worrying about which have the property that anything that suffers from them clearly can’t do STEM reliably — therefore they’re not an x-risk. Such as Goodharting your optimizer, using models outside their region of validity, or not being able to cope with ontology paradigm shifts. Some people need to stop thinking about agents created by RL as their baseline x-risk, and start thinking about agents that can do STEM well instead. This additional assumption tells you more about the agent, making it less abstract and giving you more information to reason about its behavior.
More generally, for an agent that’s an AGI value learner, if you’re concerned about an alignment failure mode that seems obvious to you, you then need to explain why it plausibly wouldn’t also be obvious to or fixable by the agent before it left the basin of attraction of value learning.