Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there’s plenty of room for “doing science” (up to some reasonable level of capability) without going all the way to automation of real-world aspects—you can still have an assistant that thinks up theory for you, just can’t have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, “you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though”.
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to “forming goals” about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are “good” according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don’t think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world—a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example—but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don’t think this happens for generically training an AI on math.
As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).
true, but see above and below.
Chess AIs don’t develop goals about the real world because they are too dumb.
If you have something trained by gradient descent solely on doing well at chess, it’s not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn’t reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn’t actually reward out-of-context thinking for chess, so it couldn’t develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself—but chess is a mathematically “closed” domain so there doesn’t seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn’t deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.
Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there’s plenty of room for “doing science” (up to some reasonable level of capability) without going all the way to automation of real-world aspects—you can still have an assistant that thinks up theory for you, just can’t have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, “you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though”.
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to “forming goals” about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are “good” according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don’t think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world—a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example—but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don’t think this happens for generically training an AI on math.
true, but see above and below.
If you have something trained by gradient descent solely on doing well at chess, it’s not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn’t reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn’t actually reward out-of-context thinking for chess, so it couldn’t develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself—but chess is a mathematically “closed” domain so there doesn’t seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn’t deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.