Excellent post! This seems like a highly promising and under-explored line of attack. I’ve had some vaguelysimilar thoughts over the years, but you’ve done a far better job articulating and developing a coherent programme. Bravo!
I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of efficient ‘abstraction thermometer’ or ‘agency thermometer’. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it’s often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we’re confronted with in the real world may be adversarially selected to be inscrutable to fast search methods in general.
I’m also interested in what goes on the other side of the equation.How are you defining what to search for in the first place? If you point your abstraction detector at an AI and it outputs “this AI has a concept of trees,” how do you gain confidence that the “trees” according to the AI (and according to your abstraction detector) are more or less what you mean by trees?
Some ad-hoc methods spring to mind, but I’m not sure what John would say.
This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn’t lose those guarantees.
However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]:
We get to pick the structure we’re searching over—the only real constraint being that it has to perform competitively. It wouldn’t matter that the ‘thermometers’ were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint.
So long as we can use the tools to prevent adversarial situations from arising in the first place, we don’t need to meet the bar of working in the face of super-human adversarial selection (I think it’s a good idea to view getting into that situation as a presumed loss condition).
In principle, greater theoretical understanding may give us more than just ‘thermometers’ - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]
Excellent post! This seems like a highly promising and under-explored line of attack. I’ve had some vaguely similar thoughts over the years, but you’ve done a far better job articulating and developing a coherent programme. Bravo!
I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of efficient ‘abstraction thermometer’ or ‘agency thermometer’. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it’s often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we’re confronted with in the real world may be adversarially selected to be inscrutable to fast search methods in general.
I’m also interested in what goes on the other side of the equation.How are you defining what to search for in the first place? If you point your abstraction detector at an AI and it outputs “this AI has a concept of trees,” how do you gain confidence that the “trees” according to the AI (and according to your abstraction detector) are more or less what you mean by trees?
Some ad-hoc methods spring to mind, but I’m not sure what John would say.
This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn’t lose those guarantees.
However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]:
We get to pick the structure we’re searching over—the only real constraint being that it has to perform competitively. It wouldn’t matter that the ‘thermometers’ were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint.
So long as we can use the tools to prevent adversarial situations from arising in the first place, we don’t need to meet the bar of working in the face of super-human adversarial selection (I think it’s a good idea to view getting into that situation as a presumed loss condition).
In principle, greater theoretical understanding may give us more than just ‘thermometers’ - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]