I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)