I agree that the danger may comes from AIs trying to achieve real-world future effects (note that this could include an AI wanting to run specific computations, and so taking real world actions in order to get more compute). The difficulty is in getting an AI to only be optimizing within the safe, siloed, narrow domain (like the AI playing chess).
There are multiple reasons why I think this is extremely hard to get for a science capable AI.
Science is usually a real-world task. It involves contact with reality, taking measurements, doing experiments, analyzing, iterating on experiments. If you are asking an AI to do this kind of (experimental) science then you are asking it to achieve real-world outcomes. For the “fusion rocket” example, I think we don’t currently have good enough simulations to allow us to actually build a fusion rocket, and so the process of this would require interacting with the real world in order to build a good enough simulation (I think the same likely applies for the kind of simulations required for nanotech).
I think this applies for some alignment research (the kind that involves interacting with humans and refining fuzzy concepts, and also the kind that has to work with the practicalities of running large-scale training runs etc). It applies less to math-flavored things, where maybe (maybe!) we can get the AI to only know math and be trained on optimizing math objectives.
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain (and also they extrapolate in unpredictable ways). As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off). Chess AIs don’t develop goals about the real world because they are too dumb.
Optimizing for a metric like “scientific value” inherently routes through the real world, because this metric is (I assume) coming from humans’ assessment of how good the research was. It isn’t just a precisely defined mathematical object that you can feed in a document and get an objective measure. Instead, you give some humans a document, and then they think about it and how useful it is: How does this work with the rest of the project? How does this help the humans achieve their (real-world!) goals? Is it well written, such that the humans find it convincing? In order to do good research, the AI must be considering these questions. The question of “Is this good research?” isn’t something objective, and so I expect if the AI is able to judge this, it will be thinking about the humans and the real world. Because the human is part of the real world and is judging research based on how useful they think it will be in the real-world, this makes the AI’s training signal about the real world. (Note that this doesn’t mean that the AI will end up optimizing for this reward signal directly, but that doing well according to the reward signal does require conceptualizing the real world). This especially applies for alignment research, where (apart from a few well scoped problems), humans will be judging the research based on their subjective impressions, rather than some objective measure.
If the AI is trained with methods similar to today’s methods (with a massive pretrain on a ton of data likely a substantial fraction of the internet, then finetuned), then it will likely know a a bunch of things about the real world and it seems extremely plausible that it forms goals based on these. This can apply if we attempt to strip out a bunch of real world data from the training e.g. only train on math textbooks. This is because a person had to write these math textbooks and so they still contain substantial information about the world (e.g. math books can use examples about the world, or make analogies to real-world things). I do agree that training only on math textbooks (likely stripped of obvious real-world references) likely makes an AI more domain limited, but it also isn’t clear how much useful work you can get out of it.
Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there’s plenty of room for “doing science” (up to some reasonable level of capability) without going all the way to automation of real-world aspects—you can still have an assistant that thinks up theory for you, just can’t have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, “you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though”.
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to “forming goals” about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are “good” according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don’t think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world—a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example—but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don’t think this happens for generically training an AI on math.
As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).
true, but see above and below.
Chess AIs don’t develop goals about the real world because they are too dumb.
If you have something trained by gradient descent solely on doing well at chess, it’s not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn’t reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn’t actually reward out-of-context thinking for chess, so it couldn’t develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself—but chess is a mathematically “closed” domain so there doesn’t seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn’t deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.
Thanks for the comment :)
I agree that the danger may comes from AIs trying to achieve real-world future effects (note that this could include an AI wanting to run specific computations, and so taking real world actions in order to get more compute). The difficulty is in getting an AI to only be optimizing within the safe, siloed, narrow domain (like the AI playing chess).
There are multiple reasons why I think this is extremely hard to get for a science capable AI.
Science is usually a real-world task. It involves contact with reality, taking measurements, doing experiments, analyzing, iterating on experiments. If you are asking an AI to do this kind of (experimental) science then you are asking it to achieve real-world outcomes. For the “fusion rocket” example, I think we don’t currently have good enough simulations to allow us to actually build a fusion rocket, and so the process of this would require interacting with the real world in order to build a good enough simulation (I think the same likely applies for the kind of simulations required for nanotech).
I think this applies for some alignment research (the kind that involves interacting with humans and refining fuzzy concepts, and also the kind that has to work with the practicalities of running large-scale training runs etc). It applies less to math-flavored things, where maybe (maybe!) we can get the AI to only know math and be trained on optimizing math objectives.
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain (and also they extrapolate in unpredictable ways). As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off). Chess AIs don’t develop goals about the real world because they are too dumb.
Optimizing for a metric like “scientific value” inherently routes through the real world, because this metric is (I assume) coming from humans’ assessment of how good the research was. It isn’t just a precisely defined mathematical object that you can feed in a document and get an objective measure. Instead, you give some humans a document, and then they think about it and how useful it is: How does this work with the rest of the project? How does this help the humans achieve their (real-world!) goals? Is it well written, such that the humans find it convincing? In order to do good research, the AI must be considering these questions. The question of “Is this good research?” isn’t something objective, and so I expect if the AI is able to judge this, it will be thinking about the humans and the real world.
Because the human is part of the real world and is judging research based on how useful they think it will be in the real-world, this makes the AI’s training signal about the real world. (Note that this doesn’t mean that the AI will end up optimizing for this reward signal directly, but that doing well according to the reward signal does require conceptualizing the real world). This especially applies for alignment research, where (apart from a few well scoped problems), humans will be judging the research based on their subjective impressions, rather than some objective measure.
If the AI is trained with methods similar to today’s methods (with a massive pretrain on a ton of data likely a substantial fraction of the internet, then finetuned), then it will likely know a a bunch of things about the real world and it seems extremely plausible that it forms goals based on these. This can apply if we attempt to strip out a bunch of real world data from the training e.g. only train on math textbooks. This is because a person had to write these math textbooks and so they still contain substantial information about the world (e.g. math books can use examples about the world, or make analogies to real-world things). I do agree that training only on math textbooks (likely stripped of obvious real-world references) likely makes an AI more domain limited, but it also isn’t clear how much useful work you can get out of it.
Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there’s plenty of room for “doing science” (up to some reasonable level of capability) without going all the way to automation of real-world aspects—you can still have an assistant that thinks up theory for you, just can’t have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, “you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though”.
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to “forming goals” about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are “good” according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don’t think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world—a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example—but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don’t think this happens for generically training an AI on math.
true, but see above and below.
If you have something trained by gradient descent solely on doing well at chess, it’s not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn’t reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn’t actually reward out-of-context thinking for chess, so it couldn’t develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself—but chess is a mathematically “closed” domain so there doesn’t seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn’t deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.