riceissa comments on World State is the Wrong Abstraction for Impact

riceissa 2 Oct 2019 0:21 UTC
6 points
It seems like one downside of impact in the AU sense is that in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans. (This is in contrast to the state-based measures of impact, where calculating the impact of a state change seems easier.) Without such an understanding, the AI seems to either do nothing (in order to prevent itself from causing bad kinds of high impact) or make a bunch of mistakes. So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning (but once we’ve made progress on value learning, it’s not clear to me what AU theory adds in terms of increased safety).
- Matthew Barnett 2 Oct 2019 3:51 UTC
  5 points
  Parent
  So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning
  Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
  From the AUP paper,
  one of our key findings is that AUP tends to preserve the ability to optimize the correct reward function even when the correct reward function is not included in the auxiliary set.
  - riceissa 2 Oct 2019 21:57 UTC
    2 points
    Parent
    I appreciate this clarification, but when I wrote my comment, I hadn’t read the original AUP post or the paper, since I assumed this sequence was supposed to explain AUP starting from scratch (so I didn’t have the idea of auxiliary set when I wrote my comment).
    - TurnTrout 2 Oct 2019 22:25 UTC
      2 points
      Parent
      It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew’s comment, I’ll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.
- TurnTrout 2 Oct 2019 0:37 UTC
  3 points
  Parent
  
  It seems like one downside of impact in the AU sense
  
  Even in worlds where we wanted to build a low impact agent that did something with the state, we’d still want to understand what people actually find impactful. (I don’t think we’re in such a world, though)
  
  in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans.
  
  Let’s review what we want: we want an agent design that isn’t incentivized to catastrophically impact us. You’ve observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn’t do that. What’s a better design? How can we reframe the problem so the solution is obvious?
  
  Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won’t be out for a while unfortunately):
  
  Why are goal-directed AIs incentivized to catastrophically impact us—why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?
  
  it’s not clear to me what AU theory adds in terms of increased safety
  
  AU theory is descriptive; it’s about why people find things impactful. We haven’t discussed what we should implement yet.