It seems like one downside of impact in the AU sense is that in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans. (This is in contrast to the state-based measures of impact, where calculating the impact of a state change seems easier.) Without such an understanding, the AI seems to either do nothing (in order to prevent itself from causing bad kinds of high impact) or make a bunch of mistakes. So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning (but once we’ve made progress on value learning, it’s not clear to me what AU theory adds in terms of increased safety).
So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning
Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
one of our key findings is that AUP tends to preserve the ability to optimize the correct reward function even when the correct reward function is not included in the auxiliary set.
I appreciate this clarification, but when I wrote my comment, I hadn’t read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn’t have the idea of auxiliary set when I wrote my comment).
It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew’s comment, I’ll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.
It seems like one downside of impact in the AU sense
Even in worlds where we wanted to build a low impact agent that did something with the state, we’d still want to understand what people actually find impactful. (I don’t think we’re in such a world, though)
in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans.
Let’s review what we want: we want an agent design that isn’t incentivized to catastrophically impact us. You’ve observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn’t do that. What’s a better design? How can we reframe the problem so the solution is obvious?
Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won’t be out for a while unfortunately):
Why are goal-directed AIs incentivized to catastrophically impact us—why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?
it’s not clear to me what AU theory adds in terms of increased safety
AU theory is descriptive; it’s about why people find things impactful. We haven’t discussed what we should implement yet.
It seems like one downside of impact in the AU sense is that in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans. (This is in contrast to the state-based measures of impact, where calculating the impact of a state change seems easier.) Without such an understanding, the AI seems to either do nothing (in order to prevent itself from causing bad kinds of high impact) or make a bunch of mistakes. So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning (but once we’ve made progress on value learning, it’s not clear to me what AU theory adds in terms of increased safety).
Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
From the AUP paper,
I appreciate this clarification, but when I wrote my comment, I hadn’t read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn’t have the idea of auxiliary set when I wrote my comment).
It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew’s comment, I’ll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.
Even in worlds where we wanted to build a low impact agent that did something with the state, we’d still want to understand what people actually find impactful. (I don’t think we’re in such a world, though)
Let’s review what we want: we want an agent design that isn’t incentivized to catastrophically impact us. You’ve observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn’t do that. What’s a better design? How can we reframe the problem so the solution is obvious?
Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won’t be out for a while unfortunately):
Why are goal-directed AIs incentivized to catastrophically impact us—why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?
AU theory is descriptive; it’s about why people find things impactful. We haven’t discussed what we should implement yet.