How would the AI do something like this if it ditched the idea that there existed some perfect U*?
Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts “believing in magic” and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.
I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn’t well defined, say because there isn’t a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of ‘full alignment’ also isn’t well defined. Then the question becomes “Is U’ clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?” So long as there is locally a well-defined “direction of optimization flow”, that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.
To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases U∗A and U∗B, so two similar-but-not-identical ways to be “fully aligned”. Then as long as U is far enough away from both of them that U’ can be closer to both U∗A and U∗B than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don’t need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U’ can either be closer to U∗A but further from U∗B or else closer to U∗B but further from U∗A and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of U∗A or of U∗B, which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of U∗A or of U∗B should be preferred, or if they’re just equally good alternatives. This could well be a computationally hard determination.
In real life, this is a pretty common situation: it’s entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,
How would the AI do something like this if it ditched the idea that there existed some perfect U*?
Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts “believing in magic” and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.
I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn’t well defined, say because there isn’t a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of ‘full alignment’ also isn’t well defined. Then the question becomes “Is U’ clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?” So long as there is locally a well-defined “direction of optimization flow”, that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.
To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases U∗A and U∗B, so two similar-but-not-identical ways to be “fully aligned”. Then as long as U is far enough away from both of them that U’ can be closer to both U∗A and U∗B than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don’t need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U’ can either be closer to U∗A but further from U∗B or else closer to U∗B but further from U∗A and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of U∗A or of U∗B, which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of U∗A or of U∗B should be preferred, or if they’re just equally good alternatives. This could well be a computationally hard determination.
In real life, this is a pretty common situation: it’s entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,