_Let’s say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent’s decision rule in a way that doesn’t rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). _
This makes intuitive sense to me! However, for concreteness, I’d pushback with an example and some questions.
Let’s assume that we want to train an AI system that autonomously operates in the financial market. Arguably, a good objective for this agent is to maximize benefits. However, due to the chaotic nature of financial markets and the unpredictability of initial conditions, the agent might develop strategies that lead to unintended and potentially harmful behaviours.
Questions:
Would the short-term strategy be useful in this case? I don’t think it would, because of the strong coupling between actors in the market.
If we were to use the definition of “doing what we would want, provided that we were in its shoes”, I’d argue this agent would basically be incapable of operating, because we do not have examples in which humans can factor in so much potentially relevant information to make up their minds.
As I understand it, the argument above doesn’t account for the agent using the best information available at the time (in the future, relative to its goal specification).
Hmm, I think my argument also applies to this case, because the “best information available at the time” might not be enough (e.g., because we cannot know whether there are missing variables, lack of precision in the initial conditions, etc). I think the only case in which this is good enough, I’d say, is when the course of action is within the forecastable horizon. But, in that case, all long-term goals have to be able to be split into much smaller pieces, which is something I am honestly not sure can be done.
I’d be interested in hearing why these expectations might not be well calibrated, ofc!
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.
This raises related questions:
If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
This makes intuitive sense to me! However, for concreteness, I’d pushback with an example and some questions.
Let’s assume that we want to train an AI system that autonomously operates in the financial market. Arguably, a good objective for this agent is to maximize benefits. However, due to the chaotic nature of financial markets and the unpredictability of initial conditions, the agent might develop strategies that lead to unintended and potentially harmful behaviours.
Questions:
Would the short-term strategy be useful in this case? I don’t think it would, because of the strong coupling between actors in the market.
If we were to use the definition of “doing what we would want, provided that we were in its shoes”, I’d argue this agent would basically be incapable of operating, because we do not have examples in which humans can factor in so much potentially relevant information to make up their minds.
Hmm, I think my argument also applies to this case, because the “best information available at the time” might not be enough (e.g., because we cannot know whether there are missing variables, lack of precision in the initial conditions, etc). I think the only case in which this is good enough, I’d say, is when the course of action is within the forecastable horizon. But, in that case, all long-term goals have to be able to be split into much smaller pieces, which is something I am honestly not sure can be done.
I’d be interested in hearing why these expectations might not be well calibrated, ofc!
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
Completely agreed!
On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913
Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.
This raises related questions:
If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Thanks for this pointer!