TurnTrout comments on Attainable Utility Preservation: Scaling to Superhuman

TurnTrout 27 Feb 2020 17:44 UTC
LW: 4 AF: 3
AF

Again, I worry that patches are based a lot on intuition.

If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.

I claim that, excepting the choice of denominator, all of the improvements follow directly from AUP $_{conceptual}$ (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).

CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing $R$ -AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.

The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.
What links here?
- Rohin Shah's comment on rohinmshah’s Shortform by Rohin Shah (17 Mar 2020 1:20 UTC; 7 points)
- Charlie Steiner 27 Feb 2020 18:40 UTC
  LW: 4 AF: 2
  AF Parent
  Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren’t strongly constrained by the result you want from them.
  For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function—the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don’t have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.
  - TurnTrout 27 Feb 2020 21:06 UTC
    LW: 4 AF: 2
    AF Parent
    
    I basically don’t have much trust for meditation in this sort of case
    
    I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.
    
    Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.
    
    To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.
    
    On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.
    
    From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.
    
    I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).
    
    There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.
    
    From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.
    
    I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.