I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
The only point on which there is plausible disagreement is “utility-maximizing agents.” On a narrow reading of “utility-maximizing agents” it is not clear why it would be important to getting more powerful performance.
On a broad reading of “utility-maximizing agents” I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don’t agree with the claim that we will be unable to reliably tell that such agents aren’t dangerous without theoretical progress.
In particular, there is an argument of the form “the prospect of a treacherous turn makes any informal analysis unreliable.” I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.)
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
The only point on which there is plausible disagreement is “utility-maximizing agents.” On a narrow reading of “utility-maximizing agents” it is not clear why it would be important to getting more powerful performance.
On a broad reading of “utility-maximizing agents” I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don’t agree with the claim that we will be unable to reliably tell that such agents aren’t dangerous without theoretical progress.
In particular, there is an argument of the form “the prospect of a treacherous turn makes any informal analysis unreliable.” I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.)