I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
F: <Current World State> → (U: <Future World State>* → number)
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
correct some minor errors in F
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
What is their decision rule? How do they use F that they know to make decisions?
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.