When you condition on π∗u, do you expect every other agent to also implement an optimal policy for u, or do they keep doing what they’re doing?
Is π∗u a humanly realistic policy, or an unboundedly optimal policy? For example, conditional on π∗u, should I expect to quickly reduce all (u-relevant) x-risks to 0?
In the future we’re likely to have much better knowledge about our universe and about logical facts that go into the expected utility computation. Do we keep redoing this normalization as time goes on, or fix it to the current time, or maybe do this normalization while pretending to know less than we actually do?
generally consider all other agents do what they would have done anyway. This agent follows some π with probability 1−2ϵ, and follow π∗u and π∗−u with probabilities ϵ. The conditionals condition on one of the two unlikely policies being chosen.
I think either works for normalisation purposes, so I’d assume human-realistic. EDIT: “Either works” is wrong, see my next answer, we should using “human-realistic”.
The normalisation process is completely time-inconsistent, and so is done once, at a specific time, and not repeated.
Do you have ideas on how the normalisation process can be improved? Because it’s very much a “better than all the alternatives I know” at the moment.
I think either works for normalisation purposes, so I’d assume human-realistic.
But they lead to very different normalization outcomes, don’t they? Say u represents total hedonic utilitarianism. If π∗u (and π∗−u) are unboundedly optimal then conditional on that, I’d take over the universe and convert everything into hedonium (respectively dolorium). But if π∗u is just human-realistic then the difference between E(u∣π∗u) and E(u∣π∗−u) is much smaller. (In one scenario, I get a 1⁄8 billionth share of the universe and turn that into hedonium/dolorium so the ratio between human-realistic and unbounded optimal is 8 billion.) On the other hand if u has strongly diminishing marginal utilities, then taking over the universe isn’t such huge improvement over a human-realistic policy. The ratio between human-realistic and unboundedly optimal might be only say 2 or 100 for this u. So this leads to different ways to normalize the two utility functions depending on “human-realistic” or “unboundedly optimal”.
For human-realistic, there’s also a question of realistic for whom? For the person whose values we’re trying to aggregate? For a typical human? For the most capable human who currently exist? Each of these leads to different weights amongst the utility functions.
The normalisation process is completely time-inconsistent, and so is done once, at a specific time, and not repeated.
Since this means we’ll almost certainly regret doing this, it strongly suggests that something is wrong with the idea.
Do you have ideas on how the normalisation process can be improved? Because it’s very much a “better than all the alternatives I know” at the moment.
Maybe normalization won’t be needed if we eventually just figure out what our true/normative values are. I think in the meantime the ideal solution would be keeping our options open rather than committing to a specific process. Perhaps you could argue for considering this idea as a “second best” option (i.e., if we’re forced to pick something due to time/competitive pressure), in which case I think it would be good to state that clearly.
But they lead to very different normalization outcomes, don’t they?
Apologies, I was wrong in my answer. The normalisation is “human-realistic”, in that the agent is estimating “the best they themselves could do” vs “the worse they themselves could do”.
Since this means we’ll almost certainly regret doing this, it strongly suggests that something is wrong with the idea.
This is an inevitable feature of any normalisation process that depends on the difference in future expected values. Suppose u is a utility that can be 1 or 0 within the next day; after that, any action or observation will only increase or decrease u by at most 10−10. The utility v, in contrast, is 0 unless the human does the same action every day for ten years, when it will become 1. The normalisation of u and v will be very different depending on whether you normalise now or in two days time.
You might say that there’s an idealised time in the past where we should normalise it (a sort of veil of ignorance), but that just involves picking a time, or a counterfactual time.
Lastly, “regret” doesn’t quite mean the same thing as usual, since this is regret between weights of preference which we hold.
Now, there is another, maybe more natural way of normalising things: cache out the utilities as examples, and see how intense our approval/disapproval of these examples is. But that approach doesn’t allow us to overcome, eg, scope insensitivity.
if we eventually just figure out what our true/normative values are.
I am entirely convinced that there are no such things. There are maps from {lists of assumptions + human behaviour + elements of the human internal process} to sets of values, but different assumptions will give different values, and we have no principled way to distinguish between them, except for using our own contradictory and underdefined meta-preferences.
The normalisation is “human-realistic”, in that the agent is estimating “the best they themselves could do” vs “the worse they themselves could do”.
But this means the normalization depends on how capable the human is, which seems strange, especially in the context of AI. In other words, it doesn’t make sense that an AI would obtain different values from two otherwise identical humans who differ only in how capable they are.
I am entirely convinced that there are no such things.
In a previous post, you didn’t seem this certain about moral anti-realism:
Even if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it’s very hard to make something converge better onto something you haven’t defined. Shifting the focus from the unknown (and maybe unknowable, or maybe even non-existent) R, to the actual P, is important.
Did you move further in the anti-realist direction since then? If so, why?
There are maps from {lists of assumptions + human behaviour + elements of the human internal process} to sets of values, but different assumptions will give different values, and we have no principled way to distinguish between them, except for using our own contradictory and underdefined meta-preferences.
I agree this is the situation today, but I don’t see how we can be so sure that it won’t get better in the future. Philosophical progress is a thing, right?
But this means the normalization depends on how capable the human is, which seems strange, especially in the context of AI.
The min-max normalisation is supposed to measure how much a particular utility function “values” the human moving from being a u-antagonist to a u-maximiser. The full impact of that change is included; so if the human is about to program an AI, the effect is huge. You might see it as the AI asking “utility u—maximise, yes or no?”, and the spread between “yes” and “no” is normalised.
Did you move further in the anti-realist direction since then? If so, why?
How I describe my position can vary a lot. Essentially I think that there might be a partial order among sets of moral axioms, in that it seems plausible to me that you could say that set A is almost-objectively better than set B (more rigorously: according to criteria c, A>B, and criteria c seems a very strong candidate for an “objectively true” axiom; something comparable to the basic properties of equality https://en.wikipedia.org/wiki/Equality_(mathematics)#Basic_properties ).
But it seems clear there is not going to be a total order, nor a maximum element.
I agree this is the situation today, but I don’t see how we can be so sure that it won’t get better in the future. Philosophical progress is a thing, right?
Progress in philosophy involves uncovering true things, not making things easier; mathematics is a close analogue. For example, computational logic would have been a lot simpler if in fact there existed an algorithm that figured out if a given Turing machine would halt. The fact that Turing’s result made everything more complicated didn’t mean that it was wrong.
Similarly, the only reason to expect that philosophy would discover moral realism to be true, is if we currently had strong reasons to suppose that moral realism is true.
Questions:
When you condition on π∗u, do you expect every other agent to also implement an optimal policy for u, or do they keep doing what they’re doing?
Is π∗u a humanly realistic policy, or an unboundedly optimal policy? For example, conditional on π∗u, should I expect to quickly reduce all (u-relevant) x-risks to 0?
In the future we’re likely to have much better knowledge about our universe and about logical facts that go into the expected utility computation. Do we keep redoing this normalization as time goes on, or fix it to the current time, or maybe do this normalization while pretending to know less than we actually do?
The way I’m imagining it:
generally consider all other agents do what they would have done anyway. This agent follows some π with probability 1−2ϵ, and follow π∗u and π∗−u with probabilities ϵ. The conditionals condition on one of the two unlikely policies being chosen.
I think either works for normalisation purposes, so I’d assume human-realistic. EDIT: “Either works” is wrong, see my next answer, we should using “human-realistic”.
The normalisation process is completely time-inconsistent, and so is done once, at a specific time, and not repeated.
Do you have ideas on how the normalisation process can be improved? Because it’s very much a “better than all the alternatives I know” at the moment.
But they lead to very different normalization outcomes, don’t they? Say u represents total hedonic utilitarianism. If π∗u (and π∗−u) are unboundedly optimal then conditional on that, I’d take over the universe and convert everything into hedonium (respectively dolorium). But if π∗u is just human-realistic then the difference between E(u∣π∗u) and E(u∣π∗−u) is much smaller. (In one scenario, I get a 1⁄8 billionth share of the universe and turn that into hedonium/dolorium so the ratio between human-realistic and unbounded optimal is 8 billion.) On the other hand if u has strongly diminishing marginal utilities, then taking over the universe isn’t such huge improvement over a human-realistic policy. The ratio between human-realistic and unboundedly optimal might be only say 2 or 100 for this u. So this leads to different ways to normalize the two utility functions depending on “human-realistic” or “unboundedly optimal”.
For human-realistic, there’s also a question of realistic for whom? For the person whose values we’re trying to aggregate? For a typical human? For the most capable human who currently exist? Each of these leads to different weights amongst the utility functions.
Since this means we’ll almost certainly regret doing this, it strongly suggests that something is wrong with the idea.
Maybe normalization won’t be needed if we eventually just figure out what our true/normative values are. I think in the meantime the ideal solution would be keeping our options open rather than committing to a specific process. Perhaps you could argue for considering this idea as a “second best” option (i.e., if we’re forced to pick something due to time/competitive pressure), in which case I think it would be good to state that clearly.
Apologies, I was wrong in my answer. The normalisation is “human-realistic”, in that the agent is estimating “the best they themselves could do” vs “the worse they themselves could do”.
This is an inevitable feature of any normalisation process that depends on the difference in future expected values. Suppose u is a utility that can be 1 or 0 within the next day; after that, any action or observation will only increase or decrease u by at most 10−10. The utility v, in contrast, is 0 unless the human does the same action every day for ten years, when it will become 1. The normalisation of u and v will be very different depending on whether you normalise now or in two days time.
You might say that there’s an idealised time in the past where we should normalise it (a sort of veil of ignorance), but that just involves picking a time, or a counterfactual time.
Lastly, “regret” doesn’t quite mean the same thing as usual, since this is regret between weights of preference which we hold.
Now, there is another, maybe more natural way of normalising things: cache out the utilities as examples, and see how intense our approval/disapproval of these examples is. But that approach doesn’t allow us to overcome, eg, scope insensitivity.
I am entirely convinced that there are no such things. There are maps from {lists of assumptions + human behaviour + elements of the human internal process} to sets of values, but different assumptions will give different values, and we have no principled way to distinguish between them, except for using our own contradictory and underdefined meta-preferences.
But this means the normalization depends on how capable the human is, which seems strange, especially in the context of AI. In other words, it doesn’t make sense that an AI would obtain different values from two otherwise identical humans who differ only in how capable they are.
In a previous post, you didn’t seem this certain about moral anti-realism:
Did you move further in the anti-realist direction since then? If so, why?
I agree this is the situation today, but I don’t see how we can be so sure that it won’t get better in the future. Philosophical progress is a thing, right?
The min-max normalisation is supposed to measure how much a particular utility function “values” the human moving from being a u-antagonist to a u-maximiser. The full impact of that change is included; so if the human is about to program an AI, the effect is huge. You might see it as the AI asking “utility u—maximise, yes or no?”, and the spread between “yes” and “no” is normalised.
How I describe my position can vary a lot. Essentially I think that there might be a partial order among sets of moral axioms, in that it seems plausible to me that you could say that set A is almost-objectively better than set B (more rigorously: according to criteria c, A>B, and criteria c seems a very strong candidate for an “objectively true” axiom; something comparable to the basic properties of equality https://en.wikipedia.org/wiki/Equality_(mathematics)#Basic_properties ).
But it seems clear there is not going to be a total order, nor a maximum element.
Progress in philosophy involves uncovering true things, not making things easier; mathematics is a close analogue. For example, computational logic would have been a lot simpler if in fact there existed an algorithm that figured out if a given Turing machine would halt. The fact that Turing’s result made everything more complicated didn’t mean that it was wrong.
Similarly, the only reason to expect that philosophy would discover moral realism to be true, is if we currently had strong reasons to suppose that moral realism is true.