It seems that it takes more something to arrive at abstractions that align with intuitive human abstractions, given that even in this case where the human version is unambiguously correct relative to what the AI was actually trained to do, and even though it reached superhuman abilities at Go, it still did not arrive at the same concept.
I don’t doubt that a more powerful AI will be able to arrive at the correct concept in the case, given the unambiguous correctness of this abstraction at this level. But we already knew that was the easy case, where we can put the relevant criteria directly in the loss function. In the case where it actually matters, it needs to learn how to be Good even though we don’t know how to put that directly in the loss function. And it failing at the easy version of this problem seems like significant evidence that it won’t actually converge to the same sort of intuitive concepts we use in the harder cases where humans don’t even converge all that well.
This is an interesting point, but it doesn’t undermine the case that deceptive alignment is unlikely. Suppose that a model doesn’t have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn’t understand the correct abstraction, it can’t instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can’t be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model’s goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here.
You’re not wrong that this is problematic for the natural abstractions hypothesis, and this definitely suggests that my optimism on natural abstractions needs to be lowered.
However, this doesn’t yet change my position capabilities work being net positive, because of 2 reasons:
Deceptive alignment was arguably the main risk in that it posed a problem for iteration schemes, and if we remove that problem, a lot of other problems become iterable. In my model, pretty much all problems in AI safety that can be iterated away will be iterated away by default, so we have to focus on the problems that are not amenable to iteration, and right now I see the natural abstractions problem as quite iterable.
We have reasons to suspect that the failure is a capabilities failure, in that convolutional neural nets implement something like a game of telephone, whereas as far as we know we don’t have good reason to suspect other algorithms would have the failure mode. And since you already suggest how capabilities work solves the natural abstractions problem in the case of Go, then this implies that natural abstractions are an iterable problem.
KataGo misjudging the safety of certain groups seems like a pretty significant blow to the Natural Abstractions Hypothesis to me.
It seems that it takes more something to arrive at abstractions that align with intuitive human abstractions, given that even in this case where the human version is unambiguously correct relative to what the AI was actually trained to do, and even though it reached superhuman abilities at Go, it still did not arrive at the same concept.
I don’t doubt that a more powerful AI will be able to arrive at the correct concept in the case, given the unambiguous correctness of this abstraction at this level. But we already knew that was the easy case, where we can put the relevant criteria directly in the loss function. In the case where it actually matters, it needs to learn how to be Good even though we don’t know how to put that directly in the loss function. And it failing at the easy version of this problem seems like significant evidence that it won’t actually converge to the same sort of intuitive concepts we use in the harder cases where humans don’t even converge all that well.
This is an interesting point, but it doesn’t undermine the case that deceptive alignment is unlikely. Suppose that a model doesn’t have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn’t understand the correct abstraction, it can’t instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can’t be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model’s goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here.
You’re not wrong that this is problematic for the natural abstractions hypothesis, and this definitely suggests that my optimism on natural abstractions needs to be lowered.
However, this doesn’t yet change my position capabilities work being net positive, because of 2 reasons:
Deceptive alignment was arguably the main risk in that it posed a problem for iteration schemes, and if we remove that problem, a lot of other problems become iterable. In my model, pretty much all problems in AI safety that can be iterated away will be iterated away by default, so we have to focus on the problems that are not amenable to iteration, and right now I see the natural abstractions problem as quite iterable.
We have reasons to suspect that the failure is a capabilities failure, in that convolutional neural nets implement something like a game of telephone, whereas as far as we know we don’t have good reason to suspect other algorithms would have the failure mode. And since you already suggest how capabilities work solves the natural abstractions problem in the case of Go, then this implies that natural abstractions are an iterable problem.