paulfchristiano comments on My current take on the Paul-MIRI disagreement on alignability of messy AI

paulfchristiano Dec 31, 2016, 8:24 AM
LW: 1 AF: 1
AF

One of Paul’s latest replies...

I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).

i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values

I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box.

searching for fundamental obstructions to aligned AI

I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.

(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)

I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.

I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).

It seems unlikely to me that alignment to complex human values comes for free.

The hope is to do a sublinear amount of additional work, not to get it for free.

It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed

It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t.

I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.
- Wei Dai Dec 31, 2016, 6:02 PM
  0 points
  AF Parent
  
  I still don’t get your position on this point, but we seem to be going around a bit in circles.
  
  Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)
  
  I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work).
  
  Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)
  
  If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement.
  
  That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
  
  I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
  - paulfchristiano Dec 31, 2016, 7:16 PM
    0 points
    AF Parent
    
    Can you be more explicit and formal about what you’re looking for?
    
    Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
    
    I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.
    
    The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
    
    Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.
    - Wei Dai Jan 9, 2017, 8:21 AM
      0 points
      AF Parent
      If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?
      
      The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
      
      If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?
      - paulfchristiano Jan 10, 2017, 8:45 PM
        0 points
        AF Parent
        
        If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI
        
        My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.
        
        indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign
        
        This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.
        
        directly through its own actions
        
        If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)
        
        Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)
        
        Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.
        
        I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.
        
        What links here?
        Wei Dai's comment on Current thoughts on Paul Christano’s research agenda by jessicata (Jul 18, 2017, 6:48 AM; 2 points)