Does the notion of “low-impact” break down, though, if humans are eventually going to use the results from these experiments to build high-impact AI?
I think the notion doesn’t break down. The low-impact AI hasn’t changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems “on us.” The low-impact AI itself hasn’t done something bad to us. I therefore think the concept I spelled out still works in this situation.
As I mentioned in the other comment, I don’t feel optimistic about actually designing these AIs via explicit low-impact objectives, though.
It seems like evaluating human AU depends on the model. There’s a “black box” sense where you can replace the human’s policy with literally anything in calculating AU for different objectives, and there’s a “transparent box” sense in which you have to choose from a distribution of predicted human behaviors.
The former is closer to what I think you mean by “hasn’t changed the humans’ AU,” but I think it’s the latter that an AI cares about when evaluating the impact of its own actions.
I want to clarify something.
I think the notion doesn’t break down. The low-impact AI hasn’t changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems “on us.” The low-impact AI itself hasn’t done something bad to us. I therefore think the concept I spelled out still works in this situation.
As I mentioned in the other comment, I don’t feel optimistic about actually designing these AIs via explicit low-impact objectives, though.
It seems like evaluating human AU depends on the model. There’s a “black box” sense where you can replace the human’s policy with literally anything in calculating AU for different objectives, and there’s a “transparent box” sense in which you have to choose from a distribution of predicted human behaviors.
The former is closer to what I think you mean by “hasn’t changed the humans’ AU,” but I think it’s the latter that an AI cares about when evaluating the impact of its own actions.
I’m discussing a philosophical framework for understanding low impact. I’m not prescribing how the AI actually accomplishes this.