You’re not wrong that this is problematic for the natural abstractions hypothesis, and this definitely suggests that my optimism on natural abstractions needs to be lowered.
However, this doesn’t yet change my position capabilities work being net positive, because of 2 reasons:
Deceptive alignment was arguably the main risk in that it posed a problem for iteration schemes, and if we remove that problem, a lot of other problems become iterable. In my model, pretty much all problems in AI safety that can be iterated away will be iterated away by default, so we have to focus on the problems that are not amenable to iteration, and right now I see the natural abstractions problem as quite iterable.
We have reasons to suspect that the failure is a capabilities failure, in that convolutional neural nets implement something like a game of telephone, whereas as far as we know we don’t have good reason to suspect other algorithms would have the failure mode. And since you already suggest how capabilities work solves the natural abstractions problem in the case of Go, then this implies that natural abstractions are an iterable problem.
You’re not wrong that this is problematic for the natural abstractions hypothesis, and this definitely suggests that my optimism on natural abstractions needs to be lowered.
However, this doesn’t yet change my position capabilities work being net positive, because of 2 reasons:
Deceptive alignment was arguably the main risk in that it posed a problem for iteration schemes, and if we remove that problem, a lot of other problems become iterable. In my model, pretty much all problems in AI safety that can be iterated away will be iterated away by default, so we have to focus on the problems that are not amenable to iteration, and right now I see the natural abstractions problem as quite iterable.
We have reasons to suspect that the failure is a capabilities failure, in that convolutional neural nets implement something like a game of telephone, whereas as far as we know we don’t have good reason to suspect other algorithms would have the failure mode. And since you already suggest how capabilities work solves the natural abstractions problem in the case of Go, then this implies that natural abstractions are an iterable problem.