Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
The only thing I disagree with is the Outer Misalignment paragrpah. Outer Misalignment seems like one of the issues that wouldn’t be solved. Largely due to goodhearts curse type stuff. This article by scott explains my hypothetical remaining worries well https://slatestarcodex.com/2018/09/25/the-tails-coming-apart-as-metaphor-for-life/
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
Any thoughts about this?
There is a reason that paragraph says
rather than
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.