Thanks for trying to make the issue more concrete and provide a way to discuss it!
One thing I want to point out is that you don’t really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.
Using sets, you have a set of world you want, and a proxy that is a superset of this (because you’re not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it’s probably aiming for the easiest part of the set to reach and stay in (submitted to the accessibility constraints that you mention). This is what should lead to instrumental convergence and the real issue with the proxies IMO.
It doesn’t seem obvious to me how this race will go by default; in fact, the likely trajectories seem to depend on lots of empirical facts about the world that I don’t have strong views on.
Let me propose another framing: there are less possible worlds in which the curves are “nice”. The good case is more specific, more constrained, and thus there are more ways things can go wrong. This doesn’t mean things will definitely go wrong or that there’s no argument that could convince us that the situation will be good by default. Just that the burden of proof is on showing that the good but less numerous worlds are somehow privileged by Reality.
Thanks for trying to make the issue more concrete and provide a way to discuss it!
One thing I want to point out is that you don’t really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.
Using sets, you have a set of world you want, and a proxy that is a superset of this (because you’re not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it’s probably aiming for the easiest part of the set to reach and stay in (submitted to the accessibility constraints that you mention). This is what should lead to instrumental convergence and the real issue with the proxies IMO.
Let me propose another framing: there are less possible worlds in which the curves are “nice”. The good case is more specific, more constrained, and thus there are more ways things can go wrong. This doesn’t mean things will definitely go wrong or that there’s no argument that could convince us that the situation will be good by default. Just that the burden of proof is on showing that the good but less numerous worlds are somehow privileged by Reality.