I don’t think I completely grok the distinction you’re trying to point at with “Shape of problem” vs “How capabilities decompose”.
I guess “Shape of problem” is about systematic incentives that will be present, like inductive biases in our training procedures, while “How capabilities decompose” is about how easy/natural it is for a mind to solve the task without solving other tasks. The latter is about “minds in general” and the former about “minds trained by us”?
But then I don’t understand some of your classifications. For example, how is “it stumbles into human-friendliness before x-risk capability” a claim about shape of the problem (instead of also depending on how hard are the tasks of making humans extinct, understanding/imitating humans, etc.), while things like “IDA does/doesn’t converge to deception (because of obfuscated arguments etc.)” (which would be a part of Scalable Oversight) are not shape of the problem, but capabilities decomposition?
I feel like this is a pretty blurry line to classify evidence (and thus maybe not the most useful, but I’m not sure).
I don’t think I completely grok the distinction you’re trying to point at with “Shape of problem” vs “How capabilities decompose”.
I guess “Shape of problem” is about systematic incentives that will be present, like inductive biases in our training procedures, while “How capabilities decompose” is about how easy/natural it is for a mind to solve the task without solving other tasks. The latter is about “minds in general” and the former about “minds trained by us”?
But then I don’t understand some of your classifications. For example, how is “it stumbles into human-friendliness before x-risk capability” a claim about shape of the problem (instead of also depending on how hard are the tasks of making humans extinct, understanding/imitating humans, etc.), while things like “IDA does/doesn’t converge to deception (because of obfuscated arguments etc.)” (which would be a part of Scalable Oversight) are not shape of the problem, but capabilities decomposition?
I feel like this is a pretty blurry line to classify evidence (and thus maybe not the most useful, but I’m not sure).