I am definitely semi-agnostic to whether SGD will ultimately be the base optimizer of choice, and whether the inner algorithm does better than SGD and causes a fast takeoff.
But I’ll assume that you are right about fast takeoff happening, and my response to that is that this would leave the alignment schemes proposed intact, for the following reasons:
Even if fast takeoff happens, the sharp left turn in the form of misgeneralization is still less likely to happen, because unlike evolution, we are unlikely to run fresh versions of an AI, and retain the same AI throughout the training run.
It mostly doesn’t affect how easy it is to learn values, and the trick of using our control of SGD to be the innate reward system still works, because of the fact that weak genetic priors that are easy to trick plus the innate reward system’s local update rule still suffices to make people reliably have a set of values like empathy for the ingroup.
SGD still has really strong corrective properties against inner misaligned agents, unlike evolution.
I do agree that fast takeoff complicates the analysis, but I don’t think it breaks the alignment methods shown in the post. If it required very strong priors to align (But with SGD we can align them to reward functions that are much more complicated than genetic priors can do), or we can’t control the innate reward system, this would be a much bigger issue.
I think there are plausible stories in which a hard left turn could happen (but as you’ve pointed out, it is extremely unlikely under the current deep learning paradigm).
For example, suppose it turns out that a class of algorithms I will simply call heuristic AIXI are much more powerful than the current deep learning paradigm.
The idea behind this class of algorithm is you basically do evolution but instead of using blind hillclimbing, you periodically ask what is the best learning algorithm I have, and then apply that to your entire process. Because this means you are constantly changing the learning algorithm, you could get the same sort of 1Mx overhang that caused the sharp left turn in human evolution.
The obvious counter is that if we think heuristic, AIXI is not safe, then we should just not use it. But the obvious counter to that is when have humans ever not done some thing because someone else told them it wasn’t safe.
I definitely agree with the claim that evolutionary strategies being effective would weaken my entire case. I do think that evolutionary methods like GAs are too hobbled by their inability to exploit white-box optimization, unlike SGD, but we shall see.
I genuinely don’t know if heuristic AIXI is a real thing or not, but if it is it combines the ability to search the whole space of possible algorithms (which evolution has but SGD doesn’t) with the ability to take advantage of higher order statistics (like SGD does but evolution doesn’t).
My best guess is that just as there was a “Deep learning” regime that only got unlocked once we had tons of compute from GPUs, there’s also a heuristic AIXI regime that unlocks at some level of compute.
I am definitely semi-agnostic to whether SGD will ultimately be the base optimizer of choice, and whether the inner algorithm does better than SGD and causes a fast takeoff.
But I’ll assume that you are right about fast takeoff happening, and my response to that is that this would leave the alignment schemes proposed intact, for the following reasons:
Even if fast takeoff happens, the sharp left turn in the form of misgeneralization is still less likely to happen, because unlike evolution, we are unlikely to run fresh versions of an AI, and retain the same AI throughout the training run.
It mostly doesn’t affect how easy it is to learn values, and the trick of using our control of SGD to be the innate reward system still works, because of the fact that weak genetic priors that are easy to trick plus the innate reward system’s local update rule still suffices to make people reliably have a set of values like empathy for the ingroup.
SGD still has really strong corrective properties against inner misaligned agents, unlike evolution.
I do agree that fast takeoff complicates the analysis, but I don’t think it breaks the alignment methods shown in the post. If it required very strong priors to align (But with SGD we can align them to reward functions that are much more complicated than genetic priors can do), or we can’t control the innate reward system, this would be a much bigger issue.
I think there are plausible stories in which a hard left turn could happen (but as you’ve pointed out, it is extremely unlikely under the current deep learning paradigm).
For example, suppose it turns out that a class of algorithms I will simply call heuristic AIXI are much more powerful than the current deep learning paradigm.
The idea behind this class of algorithm is you basically do evolution but instead of using blind hillclimbing, you periodically ask what is the best learning algorithm I have, and then apply that to your entire process. Because this means you are constantly changing the learning algorithm, you could get the same sort of 1Mx overhang that caused the sharp left turn in human evolution.
The obvious counter is that if we think heuristic, AIXI is not safe, then we should just not use it. But the obvious counter to that is when have humans ever not done some thing because someone else told them it wasn’t safe.
I definitely agree with the claim that evolutionary strategies being effective would weaken my entire case. I do think that evolutionary methods like GAs are too hobbled by their inability to exploit white-box optimization, unlike SGD, but we shall see.
I genuinely don’t know if heuristic AIXI is a real thing or not, but if it is it combines the ability to search the whole space of possible algorithms (which evolution has but SGD doesn’t) with the ability to take advantage of higher order statistics (like SGD does but evolution doesn’t).
My best guess is that just as there was a “Deep learning” regime that only got unlocked once we had tons of compute from GPUs, there’s also a heuristic AIXI regime that unlocks at some level of compute.