I agree/disagree however with the first sentence of your second paragraph depending on what you mean by “expected utility maximization”.
This reply comment does not relate to the rest of your comment.
If you maximize U: <World State> → number, for any fixed U this almost certainly leads to doom for the reasons you give.
But, if you define F: <Current World State> → (U: <Future World State>* → number), where F defines how to determine a utility function U based on, e.g. what human values are in the current world state, and then choose the future world state according to the number that the AI would expect to be returned by the hypothetical outputted utility function obtained by inputting into F the unknown actual current world state, based on the AI’s uncertain knowledge of the current world state, then I think this might not lead to doom, since the AI will correct U, and may correct some minor errors in F (where actual human values are such that the AI should correct mistakes in F and F is sufficiently close to correct that the improperly determined human values retain this property).
* I actually prefer actions/decisions here rather than future world state.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
F: <Current World State> → (U: <Future World State>* → number)
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
correct some minor errors in F
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
What is their decision rule? How do they use F that they know to make decisions?
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.
I strongly agree with your first paragraph.
I agree/disagree however with the first sentence of your second paragraph depending on what you mean by “expected utility maximization”.
This reply comment does not relate to the rest of your comment.
If you maximize U: <World State> → number, for any fixed U this almost certainly leads to doom for the reasons you give.
But, if you define F: <Current World State> → (U: <Future World State>* → number), where F defines how to determine a utility function U based on, e.g. what human values are in the current world state, and then choose the future world state according to the number that the AI would expect to be returned by the hypothetical outputted utility function obtained by inputting into F the unknown actual current world state, based on the AI’s uncertain knowledge of the current world state, then I think this might not lead to doom, since the AI will correct U, and may correct some minor errors in F (where actual human values are such that the AI should correct mistakes in F and F is sufficiently close to correct that the improperly determined human values retain this property).
* I actually prefer actions/decisions here rather than future world state.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.