Rohin Shah comments on Clarifying “AI Alignment”

Rohin Shah 26 Jan 2020 21:26 UTC
LW: 3 AF: 3
AF
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk.
Okay. What’s the argument that the risk is great (I assume this means “very bad” and not “very likely” since by hypothesis it is unlikely), or that we need a lot of time to solve it?
Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is.
I agree with this; I don’t think this is one of our cruxes. (I do think that in most cases, if we have all the information about the situation, it will be fairly clear whether something is intent aligned or not, but certainly there are situations in which it’s ambiguous. I think corrigibility is better-informally-defined, though still there will be ambiguous situations.)
Is IRL intent aligned?
Depends on the details, but the way you describe it, no, it isn’t. (Though I can see the fuzziness here.) I think it is especially clear that it is not corrigible.
It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
Yup, I agree (with the caveat that it doesn’t have to be a human’s interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
If the bridge can never collapse as long as the outside forces don’t exceed K, then resonance is covered as well (as long as it is produced by forces below K).
I meant that K was set considering wind forces, cars, etc. and was set too low to account for resonance, because you didn’t think about resonance beforehand.
(I guess resonance doesn’t involve large forces, it involves coordinated forces. The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn’t think of it.)
Nevertheless most engineering projects make heavy use of theory.
I’m not denying that? I’m not arguing against theory in general; I’m arguing against theoretical safety guarantees. I think in practice our confidence in safety often comes from empirical tests.
I’m curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?
Probably? Honestly, I’m don’t think you even need to prove the subjective regret bound; if you wrote down assumptions that I agree are realistic and capture safety (such that you could write code that determines whether or not an AI system is safe) that alone would qualify. It would be fine if it sometimes said things are unsafe when they are safe, as long as it isn’t too conservative; a weak feasibility result would help show that it isn’t too conservative.
I can quite easily imagine how “human thinking for a day is safe” can be a mathematical assumption.
Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then “human thinking for a day” is not something you can express.
- Vanessa Kosoy 1 Feb 2020 16:02 UTC
  LW: 6 AF: 3
  AF Parent
  
  Okay. What’s the argument that the risk is great (I assume this means “very bad” and not “very likely” since by hypothesis it is unlikely), or that we need a lot of time to solve it?
  
  The reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of value-according-to-humans. The reason we need a lot of time to solve it is simply that (i) it doesn’t seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution.
  
  It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
  
  Yup, I agree (with the caveat that it doesn’t have to be a human’s interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
  
  Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
  
  The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn’t think of it.
  
  “Resonance” is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
  
  I think in practice our confidence in safety often comes from empirical tests.
  
  I think that in practice our confidence in safety comes from a combination of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you’re doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
  
  I can quite easily imagine how “human thinking for a day is safe” can be a mathematical assumption.
  
  Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then “human thinking for a day” is not something you can express.
  
  I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space $F$ with given accuracy. And, suppose that “human thinking for a day” is represented by a mathematical function that we assume to be inside $F$ and that we assume to be “safe” in some formal sense (for example, it computes an action that doesn’t lose much long-term value). Then, your model can prove that imitation learning applied to human thinking for a day is safe. Of course, this example is trivial (modulo the theorem about ANNs), but for more complex settings we can get results that are non-trivial.
  - Rohin Shah 2 Feb 2020 8:55 UTC
    LW: 2 AF: 2
    AF Parent
    The reasons the risk are great are standard arguments, so I am a little confused why you ask about this.
    Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.
    Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
    I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
    Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
    1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
    2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
    In this situation, machine A is a much better plan.
    (The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it’s not meant to properly model intent alignment vs. other stuff on other axes.)
    This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
    I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)
    In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don’t prove things or 2. prove things with unrealistic assumptions that don’t hold.
    But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. [...] Without theory you cannot extrapolate.
    I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don’t build mathematical theories of how to write code, and usually don’t prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.
    It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim. (But I might be misreading you here.)
    I disagree. For example, [...]
    Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that “human thinking for a day” is within F). I would expect that I would find some bridging assumption implausible in these settings.
    - Vanessa Kosoy 7 Feb 2020 14:30 UTC
      LW: 4 AF: 2
      AF Parent
      
      I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
      
      Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
      
      Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
      
      Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
      
      In this situation, machine A is a much better plan.
      
      I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:
      
      Basic question-answer protocol
      Natural language annotation
      Quantilization of questions
      Debate over annotations
      Dealing with no user answer
      Dealing with inconsistent user answers
      Dealing with changing user beliefs
      Dealing with changing user preferences
      Self-reference in user beliefs
      Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
      Reverse questions
      Translation of counterfactuals from user frame to AI frame
      User beliefs about computations
      
      EDIT: 14. Confidence threshold for risky actions
      
      Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can’t tell.
      
      I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.
      
      I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don’t fall, don’t we?
      
      I don’t build mathematical theories of how to write code, and usually don’t prove my code correct
      
      On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
      
      It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim.
      
      I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.
      - Rohin Shah 7 Feb 2020 18:22 UTC
        LW: 2 AF: 2
        AF Parent
        Which of these features are necessary for intent-alignment and which are only necessary for strong alignment?
        As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
        I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
        This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
        On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
        Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
        Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
        But, even just understanding whether we live in such a universe requires building a mathematical theory.
        I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
        Vanessa Kosoy 8 Feb 2020 13:47 UTC
        LW: 6 AF: 3
        AF Parent
        
        As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
        
        Hmm. I appreciate the effort, but I don’t understand this answer. Maybe discussing this point further is not productive in this format.
        
        I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
        
        This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
        
        Yes, and in that perspective, the mathematical model can tell me about resonance. It’s actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn’t mean we should be careless about AGI as well.
        
        I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don’t understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don’t arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.
        
        Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
        
        Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
        
        First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don’t literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly “intuitive”: I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don’t write proofs that are machine verifiable (some people do that today, but it’s a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).
        
        Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.
        
        Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It’s true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.
        
        Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a “side benefit” of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification “merely” tells us that the implementation doesn’t have bugs (which is something we should definitely worry about too, when it becomes relevant).
        
        I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
        
        I’m not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don’t see how they could be settled only with this kind of reasoning.
        
        Rohin Shah 9 Feb 2020 1:50 UTC
        LW: 2 AF: 2
        AF Parent
        You made a claim a few comments above:
        But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
        I’m struggling to understand what you mean by “theory” here, and the programming example was trying to get at that, but not very successfully. So let’s take the sandwich example:
        I made a sandwich this morning without building mathematical theory :)
        Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to “extrapolate” to figure out how to make a sandwich anyway. Why didn’t you need theory for that extrapolation?
        Obviously this is a silly example, but I don’t currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
        Vanessa Kosoy 14 Feb 2020 18:58 UTC
        LW: 2 AF: 1
        AF Parent
        I decided that the answer deserves its own post.