Rohin Shah comments on Clarifying “AI Alignment”

Rohin Shah 19 Jan 2020 0:29 UTC
LW: 5 AF: 4
AF
This opens the possibility of agents that with “well intentioned” mistakes that take the form of sophisticated plans that are catastrophic for the user.
Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.
In the above scenario, is Alpha “motivation-aligned”
If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn’t corrigible, and it would be rather hard to argue that it is intent aligned.
But, such a concept would depend in complicated ways on the agent’s internals.
That, or it could depend on the agent’s counterfactual behavior in other situations. I agree it can’t be just the action chosen in the particular state.
Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).
I guess you wouldn’t count universality. Overall I agree. I’m relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality without making much progress.)
it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.
I do want to note that all of these require you to make assumptions of the form, “if there are traps, either the user or the agent already knows about them” and so on, in order to avoid no-free-lunch theorems.
- Vanessa Kosoy 19 Jan 2020 14:18 UTC
  LW: 12 AF: 4
  AF Parent
  
  This opens the possibility of agents that with “well intentioned” mistakes that take the form of sophisticated plans that are catastrophic for the user.
  
  Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.
  
  The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need “corrigibility” as another condition (also it is not so clear to me what this condition means).
  
  If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn’t corrigible, and it would be rather hard to argue that it is intent aligned.
  
  It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
  
  Now, I do believe that if you set up the prior correctly then it won’t happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some “neutral” course of action plus query the user (in specific, safe, ways). But this exactly shows that intent-alignment is not enough and you need further assumptions.
  
  Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).
  
  I guess you wouldn’t count universality. Overall I agree.
  
  Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I’m missing something.
  
  I’m relatively pessimistic about mathematical formalization.
  
  I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.
  
  I do want to note that all of these require you to make assumptions of the form, “if there are traps, either the user or the agent already knows about them” and so on, in order to avoid no-free-lunch theorems.
  
  No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user’s subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta’s simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that’s mostly a separate concern).
  - Rohin Shah 19 Jan 2020 21:53 UTC
    LW: 3 AF: 3
    AF Parent
    A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user’s subjective perspective.
    Sorry, that’s right. Fwiw, I do think subjective regret bounds are significantly better than the thing I meant by definition-optimization.
    It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
    Why doesn’t this also apply to subjective regret bounds?
    My guess at your answer is that Alpha wouldn’t take the irreversible action as long as the user believes that Alpha is not in Beta-simulation-world. I would amend that to say that Alpha has to know that [the user doesn’t believe that Alpha is in Beta-simulation-world]. But if Alpha knows that, then surely Alpha can predict that the user would not confirm the irreversible action?
    It seems like for subjective regret bounds, avoiding this scenario depends on your prior already “knowing” that the user thinks that Alpha is not in Beta-simulation-world (perhaps by excluding Beta-simulations). If that’s true, you could do the same thing with intent alignment / corrigibility.
    Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I’m missing something.
    It isn’t equivalent to intent alignment; but it is meant to be used as part of an argument for safety, though I guess it could be used in definition-optimization too, so never mind.
    I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.
    That is hard to say. I would want to have the reaction “oh, if I built that system, I expect it to be safe and competitive”. Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.
    I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured “safety” without “passing the buck”. I think subjective regret bounds and CIRL both make some progress on this, but somewhat “pass the buck” by requiring a well-specified hypothesis space for rewards / beliefs / observation models.
    Tbc, I also don’t think intent alignment will lead to a mathematical formalization I’m happy with—it “passes the buck” to the problem of defining what “trying” is, or what “corrigibility” is.
    - Vanessa Kosoy 21 Jan 2020 19:11 UTC
      LW: 5 AF: 3
      AF Parent
      
      It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
      
      Why doesn’t this also apply to subjective regret bounds?
      
      In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this for one toy model of how “safe in the short-term” can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.
      
      Now, you can say “with the right prior intent-alignment also works”. To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a “true reward” symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it’s doing something bad from the user’s perspective, then it is just an “innocent” mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.
      
      Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.
      
      This is related to what I call the distinction between “weak” and “strong feasibility”. Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.
      
      It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don’t see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.
      
      However, I specifically don’t want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don’t think should be done soon.
      
      I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured “safety” without “passing the buck”. I think subjective regret bounds and CIRL both make some progress on this, but somewhat “pass the buck” by requiring a well-specified hypothesis space for rewards / beliefs / observation models.
      
      Can you explain what you mean here? I agree that just saying “subjective regret bound” is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.
      - Rohin Shah 21 Jan 2020 21:48 UTC
        LW: 3 AF: 3
        AF Parent
        To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.
        I completely agree with this, but isn’t this also true of subjective regret bounds / definition-optimization? Like, when you write (emphasis mine)
        Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.
        Isn’t the assumption about the prior “doing all the work”?
        Maybe your point is that there are failure modes that aren’t covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places. Just picking one sentence (emphasis mine):
        An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
        
        I don’t see any reason of principle that significantly limits the strong feasibility results we can expect.
        And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.
        That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world. The abstract argument is an attempt to explain it; but I wouldn’t have much faith in the abstract argument by itself (which is trying to quantify over all possible ways of getting a strong feasibility result).
        However, I specifically don’t want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.
        Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can’t work for you though. (As an aside, I hope this is what MIRI is doing, but they probably aren’t.)
        Can you explain what you mean here?
        Basically what you said right after:
        I agree that just saying “subjective regret bound” is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user.
        Vanessa Kosoy 24 Jan 2020 16:26 UTC
        LW: 5 AF: 3
        AF Parent
        
        ...first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.
        
        I completely agree with this, but isn’t this also true of subjective regret bounds / definition-optimization?
        
        The idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...)
        
        Maybe your point is that there are failure modes that aren’t covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places.
        
        I disagree with the OP that (emphasis mine):
        
        I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
        
        I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.
        
        And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.
        
        I don’t think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural “smoothness” condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above.
        
        The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called “general” (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too rich, whereas a prior that (say) describes everything in terms of an MDP with a small number of states is too narrow. Finding the golden path in between is one of the big open problems.
        
        That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world.
        
        Does it? I am not sure why you have this impression. Certainly there are phenomena in the real world that we don’t yet have enough theory to understand, and certainly a given theory will fail in domains where its assumptions are not justified (where “fail” and “justified” can be a manner of degree). And yet, theory obviously played and plays a central role in science, so I don’t understand whence the fatalism.
        
        However, I specifically don’t want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.
        
        Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can’t work for you though.
        
        That seems like it would be an extremely not cost-effective way of making progress. I would invest a lot of time and effort into something that would only be disclosed to the select few, for the sole purpose of convincing them of something (assuming they are even interested to understand it). I imagine that solving AI risk will require collaboration among many people, including sharing ideas and building on other people’s ideas, and that’s not realistic without publishing. Certainly I am not going to write a Friendly AI on my home laptop :)
        
        Rohin Shah 24 Jan 2020 19:03 UTC
        LW: 10 AF: 7
        AF Parent
        I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.
        Okay, so there seem to be two disagreements:
        How bad is it that intent alignment is ill-defined
        Is work on intent alignment urgent
        The first one seems primarily about our disagreements on the utility of theory, which I’ll get to later.
        For the second one, I don’t know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don’t see the argument that it is very likely (maybe the first few AGIs don’t think about simulations; maybe it’s impossible to construct such a convincing hypothesis). I especially don’t see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want.
        (You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out intent alignment to bring risk from 10% to 3%, we can then focus on building a good prior to bring the risk down from 3% to 1%. All numbers here are very made up.)
        For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural “smoothness” condition (an example motivated by already known results). This is a strong feasibility result.
        My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won’t hold for the vast majority of cases that neural networks have been applied to today.
        I don’t find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).
        I am not sure why you have this impression.
        Consider this example:
        You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can’t be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.
        The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.
        And yet, theory obviously played and plays a central role in science, so I don’t understand whence the fatalism.
        The areas of science in which theory is most central (e.g. physics) don’t require assumptions about some complicated stuff; they simply aim to describe observations. It’s really the assumptions that make me pessimistic, which is why it would be a significant update if I saw:
        a mathematical definition of safety that I thought actually captured “safety” without “passing the buck”
        It would similarly update me if you had a piece of code that (perhaps with arbitrary amounts of compute) could take in an AI system and output “safe” or “unsafe”, and I would trust that output. (I’d expect that a mathematical definition could be turned into such a piece of code if it doesn’t “pass the buck”.)
        You might respond that intent alignment requires assumptions too, which I agree with, but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use “handwavy” assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.
        That seems like it would be an extremely not cost-effective way of making progress.
        Yeah, I broadly agree; I mostly don’t understand MIRI’s position and thought you might share it, but it seems you don’t. I agree that overall it’s a tough problem. My personal position would be to do it publicly anyway; it seems way better to have an approach to AI that we understand than the current approach, even if it shortens timelines. (Consider the unilateralist curse; but also consider that other people do agree with me, if not the people at MIRI / LessWrong.)
        Vanessa Kosoy 25 Jan 2020 19:05 UTC
        LW: 10 AF: 4
        AF Parent
        
        For the second one, I don’t know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don’t see the argument that it is very likely.
        
        First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.
        
        For a more “mundane” example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between “being wrong about what the user wants” and optimizing something completely unrelated to what the user wants?
        
        It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
        
        For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural “smoothness” condition (an example motivated by already known results). This is a strong feasibility result.
        
        My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won’t hold for the vast majority of cases that neural networks have been applied to today.
        
        I don’t know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn’t count as “strong feasibility”.
        
        I don’t find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).
        
        Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a margin requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for “if conditions” as long as you’re not too sensitive to the threshold.
        
        You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can’t be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.
        
        I don’t understand this example. If the bridge can never collapse as long as the outside forces don’t exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary.
        
        The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.
        
        Nevertheless most engineering projects make heavy use of theory. I don’t understand why you think that AGI must be different?
        
        The issue of assumptions in strong feasibility is equivalent to the question of, whether powerful agents require highly informed priors. If you need complex assumptions then effectively you have a highly informed prior, whereas if your prior is uninformed then it corresponds to simple assumptions. I think that Hanson (for example) believes that it is indeed necessary to have a highly informed prior, which is why powerful AI algorithms will be complex (since they have to encode this prior) and progress in AI will be slow (since the prior needs to be manually constructed brick by brick). I find this scenario unlikely (for example because humans successfully solve tasks far outside the ancestral environment, so they can’t be relying on genetically built-in priors that much), but not ruled out.
        
        However, I assumed that your position is not Hansonian: correct me if I’m wrong, but I assumed that you believed deep learning or something similar is likely to lead to AGI relatively soon. Even if not, you were skeptical about strong feasibility results even for deep learning, regardless of hypothetical future AI technology. But, it doesn’t look like deep learning relies on highly informed priors. What we have is, relatively simple algorithms that can, with relatively small (or even no) adaptations solve problems in completely different domains (image processing, audio processing, NLP, playing many very different games, protein folding...) So, how is it possible that all of these domains have some highly complex property that they share, and that is somehow encoded in the deep learning algorithm?
        
        It’s really the assumptions that make me pessimistic, which is why it would be a significant update if I saw a mathematical definition of safety that I thought actually captured “safety” without “passing the buck”
        
        I’m curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?
        
        ...but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use “handwavy” assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.
        
        I can quite easily imagine how “human thinking for a day is safe” can be a mathematical assumption. In general, which assumptions are formalizable depends on the ontology of your mathematical model (that is, which real-world concepts correspond to the “atomic” ingredients of your model). The choice of ontology is part of drawing the line between what you want your mathematical theory to prove and what you want to bring in as outside assumptions. Like I said before, this line definitely has to be drawn somewhere, but it doesn’t at all follow that the entire approach is useless.
        
        Rohin Shah 26 Jan 2020 21:26 UTC
        LW: 3 AF: 3
        AF Parent
        First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk.
        Okay. What’s the argument that the risk is great (I assume this means “very bad” and not “very likely” since by hypothesis it is unlikely), or that we need a lot of time to solve it?
        Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is.
        I agree with this; I don’t think this is one of our cruxes. (I do think that in most cases, if we have all the information about the situation, it will be fairly clear whether something is intent aligned or not, but certainly there are situations in which it’s ambiguous. I think corrigibility is better-informally-defined, though still there will be ambiguous situations.)
        Is IRL intent aligned?
        Depends on the details, but the way you describe it, no, it isn’t. (Though I can see the fuzziness here.) I think it is especially clear that it is not corrigible.
        It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
        Yup, I agree (with the caveat that it doesn’t have to be a human’s interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
        If the bridge can never collapse as long as the outside forces don’t exceed K, then resonance is covered as well (as long as it is produced by forces below K).
        I meant that K was set considering wind forces, cars, etc. and was set too low to account for resonance, because you didn’t think about resonance beforehand.
        (I guess resonance doesn’t involve large forces, it involves coordinated forces. The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn’t think of it.)
        Nevertheless most engineering projects make heavy use of theory.
        I’m not denying that? I’m not arguing against theory in general; I’m arguing against theoretical safety guarantees. I think in practice our confidence in safety often comes from empirical tests.
        I’m curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?
        Probably? Honestly, I’m don’t think you even need to prove the subjective regret bound; if you wrote down assumptions that I agree are realistic and capture safety (such that you could write code that determines whether or not an AI system is safe) that alone would qualify. It would be fine if it sometimes said things are unsafe when they are safe, as long as it isn’t too conservative; a weak feasibility result would help show that it isn’t too conservative.
        I can quite easily imagine how “human thinking for a day is safe” can be a mathematical assumption.
        Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then “human thinking for a day” is not something you can express.
        Vanessa Kosoy 1 Feb 2020 16:02 UTC
        LW: 6 AF: 3
        AF Parent
        
        Okay. What’s the argument that the risk is great (I assume this means “very bad” and not “very likely” since by hypothesis it is unlikely), or that we need a lot of time to solve it?
        
        The reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of value-according-to-humans. The reason we need a lot of time to solve it is simply that (i) it doesn’t seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution.
        
        It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
        
        Yup, I agree (with the caveat that it doesn’t have to be a human’s interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
        
        Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
        
        The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn’t think of it.
        
        “Resonance” is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
        
        I think in practice our confidence in safety often comes from empirical tests.
        
        I think that in practice our confidence in safety comes from a combination of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you’re doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
        
        I can quite easily imagine how “human thinking for a day is safe” can be a mathematical assumption.
        
        Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then “human thinking for a day” is not something you can express.
        
        I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space $F$ with given accuracy. And, suppose that “human thinking for a day” is represented by a mathematical function that we assume to be inside $F$ and that we assume to be “safe” in some formal sense (for example, it computes an action that doesn’t lose much long-term value). Then, your model can prove that imitation learning applied to human thinking for a day is safe. Of course, this example is trivial (modulo the theorem about ANNs), but for more complex settings we can get results that are non-trivial.
        
        Rohin Shah 2 Feb 2020 8:55 UTC
        LW: 2 AF: 2
        AF Parent
        The reasons the risk are great are standard arguments, so I am a little confused why you ask about this.
        Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.
        Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
        I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
        Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
        1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
        2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
        In this situation, machine A is a much better plan.
        (The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it’s not meant to properly model intent alignment vs. other stuff on other axes.)
        This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
        I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)
        In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don’t prove things or 2. prove things with unrealistic assumptions that don’t hold.
        But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. [...] Without theory you cannot extrapolate.
        I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don’t build mathematical theories of how to write code, and usually don’t prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.
        It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim. (But I might be misreading you here.)
        I disagree. For example, [...]
        Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that “human thinking for a day” is within F). I would expect that I would find some bridging assumption implausible in these settings.
        Expand this thread
        Vanessa Kosoy 7 Feb 2020 14:30 UTC
        LW: 4 AF: 2
        AF Parent
        
        I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
        
        Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
        
        Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
        
        Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
        
        In this situation, machine A is a much better plan.
        
        I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:
        
        Basic question-answer protocol
        Natural language annotation
        Quantilization of questions
        Debate over annotations
        Dealing with no user answer
        Dealing with inconsistent user answers
        Dealing with changing user beliefs
        Dealing with changing user preferences
        Self-reference in user beliefs
        Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
        Reverse questions
        Translation of counterfactuals from user frame to AI frame
        User beliefs about computations
        
        EDIT: 14. Confidence threshold for risky actions
        
        Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can’t tell.
        
        I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.
        
        I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don’t fall, don’t we?
        
        I don’t build mathematical theories of how to write code, and usually don’t prove my code correct
        
        On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
        
        It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim.
        
        I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.
        
        Rohin Shah 7 Feb 2020 18:22 UTC
        LW: 2 AF: 2
        AF Parent
        Which of these features are necessary for intent-alignment and which are only necessary for strong alignment?
        As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
        I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
        This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
        On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
        Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
        Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
        But, even just understanding whether we live in such a universe requires building a mathematical theory.
        I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
        Vanessa Kosoy 8 Feb 2020 13:47 UTC
        LW: 6 AF: 3
        AF Parent
        
        As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
        
        Hmm. I appreciate the effort, but I don’t understand this answer. Maybe discussing this point further is not productive in this format.
        
        I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
        
        This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
        
        Yes, and in that perspective, the mathematical model can tell me about resonance. It’s actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn’t mean we should be careless about AGI as well.
        
        I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don’t understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don’t arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.
        
        Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
        
        Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
        
        First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don’t literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly “intuitive”: I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don’t write proofs that are machine verifiable (some people do that today, but it’s a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).
        
        Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.
        
        Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It’s true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.
        
        Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a “side benefit” of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification “merely” tells us that the implementation doesn’t have bugs (which is something we should definitely worry about too, when it becomes relevant).
        
        I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
        
        I’m not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don’t see how they could be settled only with this kind of reasoning.
        
        Rohin Shah 9 Feb 2020 1:50 UTC
        LW: 2 AF: 2
        AF Parent
        You made a claim a few comments above:
        But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
        I’m struggling to understand what you mean by “theory” here, and the programming example was trying to get at that, but not very successfully. So let’s take the sandwich example:
        I made a sandwich this morning without building mathematical theory :)
        Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to “extrapolate” to figure out how to make a sandwich anyway. Why didn’t you need theory for that extrapolation?
        Obviously this is a silly example, but I don’t currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
        Vanessa Kosoy 14 Feb 2020 18:58 UTC
        LW: 2 AF: 1
        AF Parent
        I decided that the answer deserves its own post.