Human values are eventually the only important thing, but don’t help with the immediate issue of goodharting. Doing expected utility maximization with any proxy of humanity’s values, no matter how implausibly well-selected this proxy is, is still misaligned. Even if in principle there exists a goal such that maximizing towards it is not misaligned, this goal can’t be quickly, or possibly ever, found.
So for practical purposes, any expected utility maximization is always catastrophically misaligned, and there is no point in looking into supplying correct goals for it. This applies more generally to other ways of being a mature agent that knows what it wants, as opposed to being actively confused and trying not to break things in the meantime by staying within the goodhart boundary.
I think encountering strong optimization in this sense is unlikely, as AGIs are going to have mostly opaque values in a way similar to how humans do (unless a very clever alignment project makes it not be so, and then we’re goodharted). So they would also be wary of goodharting their own goals and only pursue mild optimization. This makes what AGIs do determined by the process of extrapolating their values from the complicated initial pointers to value they embody at the time. And processes of value extrapolation from an initial state vaguely inspired by human culture might lead to outcomes with convergent regularities that mitigate relative arbitrariness of the initial content of those pointers to value.
These convergent regularities in values arrived-at by extrapolation are generic values. If values are mostly generic, then the alignment problem solves itself (so long as a clever alignment project doesn’t build a paperclip maximizer that knows what it wants and doesn’t need the extrapolation process). I think this is unlikely. If merely sympathy/compassion towards existing people (such as humans) is one of the generic values, then humanity survives, but loses cosmic endowment. This seems more plausible, but far from assured.
I agree/disagree however with the first sentence of your second paragraph depending on what you mean by “expected utility maximization”.
This reply comment does not relate to the rest of your comment.
If you maximize U: <World State> → number, for any fixed U this almost certainly leads to doom for the reasons you give.
But, if you define F: <Current World State> → (U: <Future World State>* → number), where F defines how to determine a utility function U based on, e.g. what human values are in the current world state, and then choose the future world state according to the number that the AI would expect to be returned by the hypothetical outputted utility function obtained by inputting into F the unknown actual current world state, based on the AI’s uncertain knowledge of the current world state, then I think this might not lead to doom, since the AI will correct U, and may correct some minor errors in F (where actual human values are such that the AI should correct mistakes in F and F is sufficiently close to correct that the improperly determined human values retain this property).
* I actually prefer actions/decisions here rather than future world state.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
F: <Current World State> → (U: <Future World State>* → number)
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
correct some minor errors in F
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
What is their decision rule? How do they use F that they know to make decisions?
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.
Human values are eventually the only important thing, but don’t help with the immediate issue of goodharting. Doing expected utility maximization with any proxy of humanity’s values, no matter how implausibly well-selected this proxy is, is still misaligned. Even if in principle there exists a goal such that maximizing towards it is not misaligned, this goal can’t be quickly, or possibly ever, found.
So for practical purposes, any expected utility maximization is always catastrophically misaligned, and there is no point in looking into supplying correct goals for it. This applies more generally to other ways of being a mature agent that knows what it wants, as opposed to being actively confused and trying not to break things in the meantime by staying within the goodhart boundary.
I think encountering strong optimization in this sense is unlikely, as AGIs are going to have mostly opaque values in a way similar to how humans do (unless a very clever alignment project makes it not be so, and then we’re goodharted). So they would also be wary of goodharting their own goals and only pursue mild optimization. This makes what AGIs do determined by the process of extrapolating their values from the complicated initial pointers to value they embody at the time. And processes of value extrapolation from an initial state vaguely inspired by human culture might lead to outcomes with convergent regularities that mitigate relative arbitrariness of the initial content of those pointers to value.
These convergent regularities in values arrived-at by extrapolation are generic values. If values are mostly generic, then the alignment problem solves itself (so long as a clever alignment project doesn’t build a paperclip maximizer that knows what it wants and doesn’t need the extrapolation process). I think this is unlikely. If merely sympathy/compassion towards existing people (such as humans) is one of the generic values, then humanity survives, but loses cosmic endowment. This seems more plausible, but far from assured.
I strongly agree with your first paragraph.
I agree/disagree however with the first sentence of your second paragraph depending on what you mean by “expected utility maximization”.
This reply comment does not relate to the rest of your comment.
If you maximize U: <World State> → number, for any fixed U this almost certainly leads to doom for the reasons you give.
But, if you define F: <Current World State> → (U: <Future World State>* → number), where F defines how to determine a utility function U based on, e.g. what human values are in the current world state, and then choose the future world state according to the number that the AI would expect to be returned by the hypothetical outputted utility function obtained by inputting into F the unknown actual current world state, based on the AI’s uncertain knowledge of the current world state, then I think this might not lead to doom, since the AI will correct U, and may correct some minor errors in F (where actual human values are such that the AI should correct mistakes in F and F is sufficiently close to correct that the improperly determined human values retain this property).
* I actually prefer actions/decisions here rather than future world state.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.