paulfchristiano comments on Clarifying “AI Alignment”

paulfchristiano 28 Nov 2018 2:00 UTC
LW: 2 AF: 1
AF
I agree that:
- If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
- A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
I think that both
- (a) Trying to have influence over aspects of value change that people don’t much care about, and
- (b) better understanding the important processes driving changes in values
are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
(I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
- Vladimir_Nesov 28 Nov 2018 4:08 UTC
  2 points
  Parent
  
  Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
  
  This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
  
  It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).