simon comments on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

simon 27 Jan 2024 0:03 UTC
28 points
0
I’m not convinced by the argument that AI science systems are necessarily dangerous.
It’s generically* the case that any AI that is trying to achieve some real-world future effect is dangerous. In that linked post Nate Soares used chess as an example, which I objected to in a comment. An AI that is optimizing within a chess game isn’t thereby dangerous, as long as the optimization stays within the chess game. E.g., an AI might reliably choose strong chess moves, but still not show real-world Omohundro drives (e.g. not avoiding being turned off).
I think scientific research is more analogous to chess than trying to achieve a real-world effect in this regard (even if the scientific research has real-world side effects), in that you can, in principle, optimize for reliably outputting scientific insights without actually leading the AI to output anything based on its real-world effects. (the outputs are selected based on properties aligned with “scientific value”, but that doesn’t necessarily require the assessment to take into account how it will be used, or any other effect on the future of the world. You might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though).
Note: an AI that can “build a fusion rocket” is generically dangerous. But an AI that can design a fusion rocket, if that design is based on general principles and not tightly tuned on what will produce some exact real-world effect, is likely not dangerous.
*generically dangerous: I use this to mean, an AI with this properties is going to be dangerous unless some unlikely-by-default (and possibly very difficult) safety precautions are taken.
- peterbarnett 27 Jan 2024 22:09 UTC
  12 points
  6
  Parent
  Thanks for the comment :)
  I agree that the danger may comes from AIs trying to achieve real-world future effects (note that this could include an AI wanting to run specific computations, and so taking real world actions in order to get more compute). The difficulty is in getting an AI to only be optimizing within the safe, siloed, narrow domain (like the AI playing chess).
  There are multiple reasons why I think this is extremely hard to get for a science capable AI.
  1. Science is usually a real-world task. It involves contact with reality, taking measurements, doing experiments, analyzing, iterating on experiments. If you are asking an AI to do this kind of (experimental) science then you are asking it to achieve real-world outcomes. For the “fusion rocket” example, I think we don’t currently have good enough simulations to allow us to actually build a fusion rocket, and so the process of this would require interacting with the real world in order to build a good enough simulation (I think the same likely applies for the kind of simulations required for nanotech).
    I think this applies for some alignment research (the kind that involves interacting with humans and refining fuzzy concepts, and also the kind that has to work with the practicalities of running large-scale training runs etc). It applies less to math-flavored things, where maybe (maybe!) we can get the AI to only know math and be trained on optimizing math objectives.
  2. Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain (and also they extrapolate in unpredictable ways). As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off). Chess AIs don’t develop goals about the real world because they are too dumb.
  3. Optimizing for a metric like “scientific value” inherently routes through the real world, because this metric is (I assume) coming from humans’ assessment of how good the research was. It isn’t just a precisely defined mathematical object that you can feed in a document and get an objective measure. Instead, you give some humans a document, and then they think about it and how useful it is: How does this work with the rest of the project? How does this help the humans achieve their (real-world!) goals? Is it well written, such that the humans find it convincing? In order to do good research, the AI must be considering these questions. The question of “Is this good research?” isn’t something objective, and so I expect if the AI is able to judge this, it will be thinking about the humans and the real world.
    Because the human is part of the real world and is judging research based on how useful they think it will be in the real-world, this makes the AI’s training signal about the real world. (Note that this doesn’t mean that the AI will end up optimizing for this reward signal directly, but that doing well according to the reward signal does require conceptualizing the real world). This especially applies for alignment research, where (apart from a few well scoped problems), humans will be judging the research based on their subjective impressions, rather than some objective measure.
  4. If the AI is trained with methods similar to today’s methods (with a massive pretrain on a ton of data likely a substantial fraction of the internet, then finetuned), then it will likely know a a bunch of things about the real world and it seems extremely plausible that it forms goals based on these. This can apply if we attempt to strip out a bunch of real world data from the training e.g. only train on math textbooks. This is because a person had to write these math textbooks and so they still contain substantial information about the world (e.g. math books can use examples about the world, or make analogies to real-world things). I do agree that training only on math textbooks (likely stripped of obvious real-world references) likely makes an AI more domain limited, but it also isn’t clear how much useful work you can get out of it.
  - simon 29 Jan 2024 1:48 UTC
    5 points
    −1
    Parent
    Science is usually a real-world task.
    Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there’s plenty of room for “doing science” (up to some reasonable level of capability) without going all the way to automation of real-world aspects—you can still have an assistant that thinks up theory for you, just can’t have something that does the experiments as well.
    Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, “you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though”.
    Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to “forming goals” about the real world.
    I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
    If the AI does not self-improve however, then I do not see that as being the case.
    If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are “good” according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don’t think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world—a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example—but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
    Some less relevant nitpicks of points in your comment:
    Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain
    If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don’t think this happens for generically training an AI on math.
    As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).
    true, but see above and below.
    Chess AIs don’t develop goals about the real world because they are too dumb.
    If you have something trained by gradient descent solely on doing well at chess, it’s not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn’t reach, since selected against before then).
    Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn’t actually reward out-of-context thinking for chess, so it couldn’t develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself—but chess is a mathematically “closed” domain so there doesn’t seem to be any reason out-of-context thinking would be developed.
    The same applies to math in general where the math doesn’t deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.
- Jeremy Gillen 30 Jan 2024 2:33 UTC
  3 points
  1
  Parent
  I think you and Peter might be talking past each other a little, so I want to make sure I properly understand what you are saying. I’ve read your comments here and on Nate’s post, and I want to start a new thread to clarify things.
  I’m not sure exactly what analogy you are making between chess AI and science AI. Which properties of a chess AI do you think are analogous to a scientific-research-AI?
  - The constraints are very easy to specify (because legal moves can be easily locally evaluated). In other words, the set of paths considered by the AI is easy to define, and optimization can be constrained to only search this space.
  - The task of playing chess doesn’t at all require or benefit from modelling any other part of the world except for the simple board state.
  I think these are the main two reasons why current chess AIs are safe.
  Separately, I’m not sure exactly what you mean when you’re saying “scientific value”. To me, the value of knowledge seems to depend on the possible uses of that knowledge. So if an AI is evaluating “scientific value”, it must be considering the uses of the knowledge? But you seem to be referring to some more specific and restricted version of this evaluation, which doesn’t make reference at all to the possible uses of the knowledge? In that case, can you say more about how this might work?
  Or maybe you’re saying that evaluating hypothetical uses of knowledge can be safe? I.e. there’s a kind of goal that wants to create “hypothetically useful” fusion-rocket-designs, but doesn’t want this knowledge to have any particular effect on the real future.
  You might be reading us as saying that “AI science systems are necessarily dangerous” in the sense that it’s logically impossible to have an AI science system that isn’t also dangerous? We aren’t saying this. We agree that in principle such a system could be built.
  - simon 3 Feb 2024 19:38 UTC
    2 points
    0
    Parent
    While some disagreement might be about relatively mundane issues, I think there’s some more fundamental disagreement about agency as well.
    I my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI’s decision to give output X depends on the fact that output X has some specific effects in the future.
    Whereas, if you train it on a problem where solutions don’t need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that’s simpler.
    So if you train an AI to give solutions to scientific problems, I don’t think, in general, that that needs to depend on the future, so I think that it’s likely learn the direct relationships between the data and the solutions. I.e. it’s not merely a logical possibility to make it not especially dangerous, but that’s the default outcome if you give it problems that don’t need to depend on specific effects of the output.
    Now, if you were instead to give it a problem that had to depend on the effects of the output on the future, then it would be dangerous...but note that e.g. chess, even though it maps onto a game played in the real world in the future, can also be understood in abstract terms so you don’t actually need to deal with anything outside the chess game itself.
    In general, I just think that predicting the future of the world and choosing specific outputs based on their effects on the real world is a complicated way to solve problems and expect things to take shortcuts when possible.
    Once something does care about the future, then it will have various instrumental goals about the future, but the initial step about actually caring about the future is very much not trivial in my view!
    - Jeremy Gillen 12 Feb 2024 6:54 UTC
      2 points
      0
      Parent
      In my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI’s decision to give output X depends on the fact that output X has some specific effects in the future.
      Agreed.
      Whereas, if you train it on a problem where solutions don’t need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that’s simpler.
      The “problem where solutions don’t need to depend on effects” is where we disagree. I agree such problems exist (e.g. formal proof search), but those aren’t the kind of useful tasks we’re talking about in the post. For actual concrete scientific problems, like outputting designs for a fusion rocket, the “simplest” approach is to be considering the consequences of those outputs on the world. Otherwise, how would it internally define “good fusion rocket design that works when built”? How would it know not to use a design that fails because of weaknesses in the metal that will be manufactured into a particular shape for your rocket? A solution to building a rocket is defined by its effects on the future (not all of its effects, just some of them, i.e. it doesn’t explode, among many others).
      I think there’s a (kind of) loophole here, where we use an “abstract hypothetical” model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by “understood in abstract terms”? So the AI has defined “good fusion rocket design” as “fusion rocket that is built by not-real hypothetical humans based on my design and functions in a not-real hypothetical universe and has properties and consequences XYZ” (but the hypothetical universe isn’t the actual future, it’s just similar enough to define this one task, but dissimilar enough that misaligned goals in this hypothetical world don’t lead to coherent misaligned real-world actions). Is this what you mean? Rereading your comment, I think this matches what you’re saying, especially the chess game part.
      The part I don’t understand is why you’re saying that this is “simpler”? It seems equally complex in kolmogorov complexity and computational complexity.
      - simon 18 Feb 2024 7:19 UTC
        2 points
        0
        Parent
        I think there’s a (kind of) loophole here, where we use an “abstract hypothetical” model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by “understood in abstract terms”?
        More or less, yes (in the case of engineering problems specifically, which I think is more real-world-oriented than most science AI).
        The part I don’t understand is why you’re saying that this is “simpler”? It seems equally complex in kolmogorov complexity and computational complexity.
        What I’m saying is “simpler” is that, given a problem that doesn’t need to depend on the actual effects of the outputs on the future of the real world (where operating in a simulation is an example, though one that could become riskily close to the real world depending on the information taken into account by the simulation—it might not be a good idea to include highly detailed political risks of other humans thwarting construction in a fusion reactor construction simulation for example), it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
        Jeremy Gillen 20 Feb 2024 1:08 UTC
        1 point
        0
        Parent
        I feel like you’re proposing two different types of AI and I want to disambiguate them. The first one, exemplified in your response to Peter (and maybe referenced in your first sentence above), is a kind of research assistant that proposes theories (after having looked at data that a scientist is gathering?), but doesn’t propose experiments and doesn’t think about the usefulness of its suggestions/theories. Like a Solomonoff inductor that just computes the simplest explanation for some data? And maybe some automated approach to interpreting theories?
        The second one, exemplified by the chess analogy and last paragraph above, is a bit like a consequentialist agent that is a little detached from reality (can’t learn anything, has a world model that we designed such that it can’t consider new obstacles).
        Do you agree with this characterization?
        What I’m saying is “simpler” is that, given a problem that doesn’t need to depend on the actual effects of the outputs on the future of the real world […], it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
        I accept chess and formal theorem-proving as examples of problem where we can define the solution without using facts about the real-world future (because we can easily write down formally a definition of what the solution looks like).
        For a more useful problem (e.g. curing a type of cancer) we (the designers) only know how to define a solution in terms of real world future states (patient is alive, healthy, non traumatized, etc). I’m not saying there doesn’t exist a definition of success that doesn’t involve referencing real-world future states. But the AI designers don’t know it (and I expect it would be relatively complicated).
        My understanding of your simplicity argument is that it is saying that it is computationally cheaper for a trained AI to discover during training a non-consequence definition of the task, despite a consequentialist definition being the criterion used to train it? If so, I disagree that computation cost is very relevant here, generalization (to novel obstacles) is the dominant factor determining how useful this AI is.
- quetzal_rainbow 27 Jan 2024 6:23 UTC
  2 points
  3
  Parent
  The difference is in size of economic output. “Do science” in a sense “produce scientifically-looking output” can even modern GPT-4, but “do science” in sense “find novel (surprising for domain experts) economically valuable discoveries” is totally different thing. “Design fusion rocket” can have multiple levels of quality. If you mean by this “output instruction using which team of moderately competent engineers can launch fusion rocket on first try”, I think, corresponding cognitive engine has all elements necessary to make in generically dangerous.
- Jacob Pfau 27 Jan 2024 5:04 UTC
  2 points
  0
  Parent
  I see Simon’s point as my crux as well, and am curious to see a response.
  
  It might be worth clarifying two possible reasons for disagreement here; are either of the below assumed by the authors of this post?
  
  (1) Economic incentives just mean that the AI built will also handle the economic transactions, procurement processes, and other external-world tasks related to the science/math problems it’s tasked with. I find this quite plausible, but I suspect the authors do not intend to assume this?
  
  (2) Even if the AI training is domain-specific/factored (i.e. it only handles actions within a specified domain) I’d expect some optimization pressure to be unrelated to the task/domain and to instead come from external world costs i.e. compute or synthesis costs. I’d expect such leakage to involve OOMs less optimization power than the task(s) at hand, and not to matter before godlike AI. Insofar as that leakage is crucial to Jeremy and Peter’s argument I think this should be explicitly stated.
  - Jeremy Gillen 30 Jan 2024 2:55 UTC
    1 point
    0
    Parent
    We aren’t implicitly assuming (1) in this post. (Although I agree there will be economic pressure to expand the use of powerful AI, and this adds to the overall risk).
    I don’t understand what you mean by (2). I don’t think I’m assuming it, but can’t be sure.
    One hypothesis: That AI training might (implicitly? Through human algorithm iteration?) involve a pressure toward compute efficient algorithms? Maybe you think that this a reason we expect consequentialism? I’m not sure how that would relate to the training being domain-specific though.