Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Completely agreed!
On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Thanks for this pointer!