I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can’t converge with moderate accuracy to the right world model, we’d expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.
That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.
I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can’t converge with moderate accuracy to the right world model, we’d expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.
That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.
Thanks for the long answer. I agree that my question is likely more tangential.