Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
Completely agreed!
On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913