Yeah, I agree it is doable in some environments, but if I imagine a world with AGI in it that’s good at aggregating human preferences, I’d be pretty shocked if this happened via the AGI asking humans to provide actions into some system that both the human and AGI understand and then sometimes overriding those actions in order to provide the right incentives to the human.
It’s more plausible if this happens just at training, but I expect that we’ll want our AI systems to be learning and aggregating our values all the time, not just during the initial training.
Yeah, I agree it is doable in some environments, but if I imagine a world with AGI in it that’s good at aggregating human preferences, I’d be pretty shocked if this happened via the AGI asking humans to provide actions into some system that both the human and AGI understand and then sometimes overriding those actions in order to provide the right incentives to the human.
It’s more plausible if this happens just at training, but I expect that we’ll want our AI systems to be learning and aggregating our values all the time, not just during the initial training.