paulfchristiano comments on Clarifying “AI Alignment”

paulfchristiano 26 Nov 2018 22:00 UTC
LW: 2 AF: 1
0
AF
I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
- Wei Dai 27 Nov 2018 0:46 UTC
  LW: 4 AF: 2
  0
  AF Parent
  
  People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
  
  It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
  
  As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
  - paulfchristiano 28 Nov 2018 2:00 UTC
    LW: 2 AF: 1
    0
    AF Parent
    I agree that:
    If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
    A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
    I think that both
    (a) Trying to have influence over aspects of value change that people don’t much care about, and
    (b) better understanding the important processes driving changes in values
    are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
    (I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
    - Vladimir_Nesov 28 Nov 2018 4:08 UTC
      2 points
      0
      Parent
      
      Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
      
      This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
      
      It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).