My sense is that a high level of capability implies (2) but not (1).
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
And that (1) is way more obviously dangerous
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.
Sure, kinda. But (2) is an unstable state. There’s at least some pressure toward (1) both during training and during online activity. This makes (1) very likely eventually, although it’s less clear exactly when.
A human that gets distracted and pursues icecream whenever they see icecream is less competent at other things, and will notice this and attempt to correct it within themselves if possible. A person that doesn’t pick up free money on tuesdays because tuesday is I-don’t-care-about-money-day will be annoyed about this on wednesday, and attempt to correct it in future.
Competent research requires at least some long-term goals. These will provide an incentive for any context-dependent goals to combine or be removed. (although the strength of this incentive is of course different for different cases of inconsistency, and the difficulty of removing inconsistency is unclear to me. Seems to depend a lot on the specifics).
This seems true to me overall, but the only reason is because (1) is more capable of competently pursuing long-term plans. Since we’re conditioning on that capability anyway, I would expect everything on the spectrum between (1) and (2) to be potentially dangerous.