It seems there is a generalization of this. I expect there to be many properties of an AI system and an environment such that if the value of the property changes alignment doesn’t break.
The hard part is to figure out which properties are most useful to pay attention to. Here are a few:
Capability (as discussed in the OP)
Could be something very specific like the speed of compute
Pseudo Cartesianess (a system might be effectively cartesian at a certain level of capability before it figures out how to circumvent some constraints we put on it)
Alignment (ideally we would like to detect and shut down if the agent becomes misaligned)
Complexity of the system (maybe your alignment scheme rests on being able to understand the world model of the system, in which case it might stop working as we move from modeling toy worlds to modeling the real world)
etc.
I think it can be useful to consider the case where alignment breaks as you decrease capabilities. For example, you might think of constructing a minimal set of assumptions such that you would know how to solve alignment. One might be, having an arbitrary amount of compute and memory available that can execute any halting program arbitrarily fast. If we want to remove this assumption it might break alignment. It’s pretty easy to see how alignment could break in this case, but it seems useful to have the concept of the generalized version.
It seems there is a generalization of this. I expect there to be many properties of an AI system and an environment such that if the value of the property changes alignment doesn’t break.
The hard part is to figure out which properties are most useful to pay attention to. Here are a few:
Capability (as discussed in the OP)
Could be something very specific like the speed of compute
Pseudo Cartesianess (a system might be effectively cartesian at a certain level of capability before it figures out how to circumvent some constraints we put on it)
Alignment (ideally we would like to detect and shut down if the agent becomes misaligned)
Complexity of the system (maybe your alignment scheme rests on being able to understand the world model of the system, in which case it might stop working as we move from modeling toy worlds to modeling the real world)
etc.
I think it can be useful to consider the case where alignment breaks as you decrease capabilities. For example, you might think of constructing a minimal set of assumptions such that you would know how to solve alignment. One might be, having an arbitrary amount of compute and memory available that can execute any halting program arbitrarily fast. If we want to remove this assumption it might break alignment. It’s pretty easy to see how alignment could break in this case, but it seems useful to have the concept of the generalized version.