It seems like one downside of impact in the AU sense
Even in worlds where we wanted to build a low impact agent that did something with the state, we’d still want to understand what people actually find impactful. (I don’t think we’re in such a world, though)
in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans.
Let’s review what we want: we want an agent design that isn’t incentivized to catastrophically impact us. You’ve observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn’t do that. What’s a better design? How can we reframe the problem so the solution is obvious?
Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won’t be out for a while unfortunately):
Why are goal-directed AIs incentivized to catastrophically impact us—why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?
it’s not clear to me what AU theory adds in terms of increased safety
AU theory is descriptive; it’s about why people find things impactful. We haven’t discussed what we should implement yet.
Even in worlds where we wanted to build a low impact agent that did something with the state, we’d still want to understand what people actually find impactful. (I don’t think we’re in such a world, though)
Let’s review what we want: we want an agent design that isn’t incentivized to catastrophically impact us. You’ve observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn’t do that. What’s a better design? How can we reframe the problem so the solution is obvious?
Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won’t be out for a while unfortunately):
Why are goal-directed AIs incentivized to catastrophically impact us—why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?
AU theory is descriptive; it’s about why people find things impactful. We haven’t discussed what we should implement yet.