One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can’t be generally-retargetable. E.g. if you consider something like stockfish, it’s a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to “maximize the max number of pawns you ever have” you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.
Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that “maximize the max number of pawns you ever have” compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.
This also implies that even if your AI has the concept of “human values” in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on “human values”, or else it won’t be competitive with AIs that have more specialized optimization algorithms.
I agree. My comment here on Rohin and John’s thread is a poor attempt at saying something similar, but also observing that having the machinery to do the ‘find the good heuristics’ thing is itself a (somewhat necessary?) property of ‘recursive-ish search’ (at least of the flavour applicable to high-dimensional ‘difficult’ problem-spaces). In humans and animals I think this thing is something like ‘motivated exploration’ aka ‘science’ aka ‘experimentation’, plus magic abstraction-formation and -recomposition.
I think it’s worth trying to understand better how these pieces fit together, and to what extent these burdens can (or will) be overcome by compute and training scale.
This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn’t that buy us a lot even without the retargeting mechanism?
We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today’s large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.
Are you saying that the AIs we train will be optimization algorithms that are literally the best at optimizing some objective given a fixed compute budget? Can you elaborate on why that is?
Not literally the best, but retargetable algorithms are on the far end of the spectrum of “fully specialized” to “fully general”, and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than “fully general” algorithms, so there’s decently strong pressure to be towards the “specialized” side.
I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it’s going to be closer to “general algorithms just can’t compete” than “it’s just a little worse”. E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp → pseudolinear if you are specialized to your domain).
I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than “fully general” algorithms, so there’s decently strong pressure to be towards the “specialized” side
I basically buy that claim. The catch is that those specialized AIs won’t be AGIs, for obvious reasons, and at the end of the day it’s the AGIs which will have most of X-risk impact.
OK, cool. How do you think generalization works? I thought the idea was that instead of finding a specific technique that only works on the data you were trained on, sufficiently big NN’s trained on sufficiently diverse data end up finding more general techniques that work on that data + other data that is somewhat different.
Generalization ability is a key metric for AGI, which I expect to go up before the end; like John said the kinds of AI we care about are the kinds that are pretty good at generalizing, meaning that they ARE close to the “fully general” end of the spectrum, or at least close enough that whatever they are doing can be retargeted to lots of other environments and tasks besides the exact ones they were trained on. Otherwise, they wouldn’t be AGI.
humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they’re much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with “fully general” AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just “think really hard” and “optimize within their own head.” This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it’s easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn’t know how to do that for human values/things that can’t be measured.
Even if you’re selecting reasonably hard for “ability to generalize” by default, the range of tasks you’re selecting for aren’t all going to be “equally difficult”, and you’re going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to “optimize human values” aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.
Nobody is asking that the AI can also generalize to “optimize human values as well as the best available combination of skills it has otherwise...” at least, I wasn’t asking that. (At no point did I assume that fully general means ‘equally good’ at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won’t be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren’t as good at it as they are at optimizing for status) and so they’ll lose competitions to the remaining AIs that haven’t been modified?
(My response to this would be ah, this makes sense, but I don’t expect there to be this much competition so I’m not bothered by this problem. I think if we have the interpretability tools we’ll probably be able to retarget the search of all relevant AIs, and then they’ll optimize for human values inefficiently but well enough to save the day.)
I think competitiveness matters a lot even if there’s only moderate amounts of competitive pressure. The gaps in efficiency I’m imagining are less “10x worse” and more like “I only had support vector machines and you had SGD”
One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can’t be generally-retargetable. E.g. if you consider something like stockfish, it’s a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to “maximize the max number of pawns you ever have” you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.
Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that “maximize the max number of pawns you ever have” compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.
This also implies that even if your AI has the concept of “human values” in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on “human values”, or else it won’t be competitive with AIs that have more specialized optimization algorithms.
I agree. My comment here on Rohin and John’s thread is a poor attempt at saying something similar, but also observing that having the machinery to do the ‘find the good heuristics’ thing is itself a (somewhat necessary?) property of ‘recursive-ish search’ (at least of the flavour applicable to high-dimensional ‘difficult’ problem-spaces). In humans and animals I think this thing is something like ‘motivated exploration’ aka ‘science’ aka ‘experimentation’, plus magic abstraction-formation and -recomposition.
I think it’s worth trying to understand better how these pieces fit together, and to what extent these burdens can (or will) be overcome by compute and training scale.
This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn’t that buy us a lot even without the retargeting mechanism?
We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today’s large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.
Are you saying that the AIs we train will be optimization algorithms that are literally the best at optimizing some objective given a fixed compute budget? Can you elaborate on why that is?
Not literally the best, but retargetable algorithms are on the far end of the spectrum of “fully specialized” to “fully general”, and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than “fully general” algorithms, so there’s decently strong pressure to be towards the “specialized” side.
I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it’s going to be closer to “general algorithms just can’t compete” than “it’s just a little worse”. E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp → pseudolinear if you are specialized to your domain).
I basically buy that claim. The catch is that those specialized AIs won’t be AGIs, for obvious reasons, and at the end of the day it’s the AGIs which will have most of X-risk impact.
OK, cool. How do you think generalization works? I thought the idea was that instead of finding a specific technique that only works on the data you were trained on, sufficiently big NN’s trained on sufficiently diverse data end up finding more general techniques that work on that data + other data that is somewhat different.
Generalization ability is a key metric for AGI, which I expect to go up before the end; like John said the kinds of AI we care about are the kinds that are pretty good at generalizing, meaning that they ARE close to the “fully general” end of the spectrum, or at least close enough that whatever they are doing can be retargeted to lots of other environments and tasks besides the exact ones they were trained on. Otherwise, they wouldn’t be AGI.
Would you agree with that? I assume not...
humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they’re much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with “fully general” AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just “think really hard” and “optimize within their own head.” This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it’s easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn’t know how to do that for human values/things that can’t be measured.
Even if you’re selecting reasonably hard for “ability to generalize” by default, the range of tasks you’re selecting for aren’t all going to be “equally difficult”, and you’re going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to “optimize human values” aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.
Nobody is asking that the AI can also generalize to “optimize human values as well as the best available combination of skills it has otherwise...” at least, I wasn’t asking that. (At no point did I assume that fully general means ‘equally good’ at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won’t be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren’t as good at it as they are at optimizing for status) and so they’ll lose competitions to the remaining AIs that haven’t been modified?
(My response to this would be ah, this makes sense, but I don’t expect there to be this much competition so I’m not bothered by this problem. I think if we have the interpretability tools we’ll probably be able to retarget the search of all relevant AIs, and then they’ll optimize for human values inefficiently but well enough to save the day.)
I think competitiveness matters a lot even if there’s only moderate amounts of competitive pressure. The gaps in efficiency I’m imagining are less “10x worse” and more like “I only had support vector machines and you had SGD”