In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”.
Ok, this is helpful for making a connection between my way of thinking and the “mistake model” way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don’t know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.
Yeah, I agree that the mistake model implied by your proposal isn’t correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.
Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the “mistake model” way of thinking. I use the “mistake model” way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you’re relying on in your alignment proposal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through. But of course, not hitting this target just means that we don’t do the perfectly optimal thing—it’s totally possible that we end up doing something that is only very slightly suboptimal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through.
The replacement feels just as obscure to me as the original.
People often argue “there is no true utility function for humans” because we often do things that are contradictory that imply that we violate the VNM axioms. However, in theory you could look at all action sequences, rank them, take the best one, and find a utility function for which that action sequence is optimal, and you could call that the utility function that you want. That utility function exists as long as you agree that an ordering over action sequences exists, which seems very reasonable.
TL;DR the point of that reframing is to overcome objections that the “true utility function” doesn’t exist.
Thanks! I think I understand the intent of the rephrasing now.
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we’re probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That’s the part that feels obscure to me because I think we’ll always be in this unsatisfying epistemic situation where we’re nervous about making some kind of mistake by the light of a standard that we cannot properly describe.
I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I’m just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that’s fine if we’re treating them in the same way Yudkowsky treats “magic reality fluid” – as a placeholder for whatever comes once we’re less confused about “measure”.)
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
Oh yeah, I was definitely speaking normatively there.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake.
Agreed, I’m just saying that in principle there exists some “best” way of making those calls.
Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment
Agreed that I’m assuming that there is a “best” mistake model, I wouldn’t say that it has to be independent from our own judgment.
Ok, this is helpful for making a connection between my way of thinking and the “mistake model” way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don’t know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.
Yeah, I agree that the mistake model implied by your proposal isn’t correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.
Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the “mistake model” way of thinking. I use the “mistake model” way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you’re relying on in your alignment proposal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through. But of course, not hitting this target just means that we don’t do the perfectly optimal thing—it’s totally possible that we end up doing something that is only very slightly suboptimal.
The replacement feels just as obscure to me as the original.
What do you mean by “obscure”?
People often argue “there is no true utility function for humans” because we often do things that are contradictory that imply that we violate the VNM axioms. However, in theory you could look at all action sequences, rank them, take the best one, and find a utility function for which that action sequence is optimal, and you could call that the utility function that you want. That utility function exists as long as you agree that an ordering over action sequences exists, which seems very reasonable.
TL;DR the point of that reframing is to overcome objections that the “true utility function” doesn’t exist.
Thanks! I think I understand the intent of the rephrasing now.
What I meant with “obscure” is that both “true utility function” and “utility function that encodes the optimal actions to take for the best possible universe” have normative terminology in them that I don’t know how to reduce or operationalize.
For instance, imagine I am looking at action sequences and ranking them. Presumably large portions of that process would feel like difficult judgment calls where I’d feel nervous about still making some kind of mistake. Both your phrasings (to my ears) carry the connotation that there is a “best” mistake model, one which is in a relevant sense independent from our own judgment, where we can learn things that will make us more and more confident that now we’re probably not making mistakes anymore because of progress in finding the correct way of thinking about our values. That’s the part that feels obscure to me because I think we’ll always be in this unsatisfying epistemic situation where we’re nervous about making some kind of mistake by the light of a standard that we cannot properly describe.
I do get the intuition for thinking in these terms, though. It feels conceivable that another discovery similar to what cognitive biases did could improve our thinking, and I definitely agree that we want a concept for staying open to this possibility. I’m just pointing out that non-operationalized normative concepts seem obscure. (Though maybe that’s fine if we’re treating them in the same way Yudkowsky treats “magic reality fluid” – as a placeholder for whatever comes once we’re less confused about “measure”.)
Oh yeah, I was definitely speaking normatively there.
Agreed, I’m just saying that in principle there exists some “best” way of making those calls.
Agreed that I’m assuming that there is a “best” mistake model, I wouldn’t say that it has to be independent from our own judgment.