The capability amplification section also seems under-motivated to me. Paul writes: “If we start with a human, then RL will only ever produce human-level reasoning about long-term consequences or about “what is good.”” But absent problems like those you describe in this post, I’m inclined to agree with Eliezer that
If arguendo you can construct an exact imitation of a human, it possesses exactly the same alignment properties as the human; and this is true in a way that is not true if we take a reinforcement learner and ask it to maximize an approval signal originating from the human. (If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.)
In other words, if we are aiming for Bostrom’s maxipok (maximum probability of an OK outcome), it seems plausible to me that “merely” Paul’s level of moral reasoning is sufficient to get us there, especially if the keys to the universe get handed back. If this is our biggest alignment-specific problem, I might sooner allocate marginal research hours towards improving formal methods or something like that.
plausible to me that “merely” Paul’s level of moral reasoning is sufficient to get us there
The hard part of “what is good” isn’t the moral part, it’s understanding things like “in this world, do humans actually have adequate control of and understanding of the situation?” Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn’t seem wise, regardless of your views on moral questions.
I agree that Paul level reasoning is fine if no one else is building AI systems with more powerful reasoning.
Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn’t seem wise, regardless of your views on moral questions.
Interesting. I think there are some important but subtle distinctions here.
In the standard supervised learning setup, we provide a machine learning algorithm with some X (in this case, courses of action an AI could take) and some Y (in this case, essentially real numbers representing the degree to which we approve of the courses of action). The core challenge of machine learning is to develop a model which extrapolates well beyond this data. So then the question becomes… does it extrapolate well in the sense of accurately predicting Paul-level reasoning, including deficiencies Paul would exhibit when examining complex or deceptive scenarios that are at the limit of Paul’s ability to understand? Or does it extrapolate well in the sense of accurately predicting what Paul would desire on reflection, given access to all of the AI’s knowledge, cognitive resources, etc.?
Let’s assume for the sake of argument that all of the X and Y data is “good”, i.e. it doesn’t make the algorithm think it’s the first Paul which is supposed to get extrapolated by including a mistake that only the first Paul would make. I’ll talk about the case where we have some bad data at the end.
The standard way to measure the effectiveness of extrapolation in machine learning is to make use of a dev set. Unfortunately, that doesn’t help in this case because we don’t have access to labeled data from “Paul who has reflected a bunch given access to all of the AI’s knowledge, cognitive resources, etc.” If we did have access to such data, we could find a data point that the two Paul’s label differently and test the model on that. (However, we might do a similar sort of test by asking a child to provide some labeled data, then checking to see whether the model assigns nontrivial credence to the answers an adult gives on data points where the child and the adult disagree.)
In poetic terms, we want the system to be asking itself:
Is there a plausible model that fits the labeled data I’ve been given which leads me to believe this world is not one in which humans actually have adequate control and understanding of the situation? Does there exist some model for the user’s preferences such that I assign a decently high prior to this model, the model fits the labeled data I’ve been given, and when this model is extrapolated to this [malign] clever scheme I’ve dreamed up, it returns either “this scheme is too complicated for me to evaluate and it should be penalized on that basis” or “this scheme is just bad”?
In the absence of data which distinguishes between two hypotheses, belief in one hypothesis or the other comes down to the choice of prior. So you want the AI’s cognitive architecture to be structured so that whatever concepts, learning capabilities, prediction capabilities, etc. which make it cognitively powerful also get re-used in the service of generating plausible extrapolations from the labeled data the user has provided. Then, if any of those extrapolations assign nontrivial credence to some plan being malign, that’s a strike against it.
Re: the bad data case, you might handle this using the same sort of techniques which are normally used for mislabeled or noisy data. For example, split the data into 30 folds, train an ensemble on every possible combination of 10 folds, and if any one of the resulting models objects to some action, nix it. Now we’re resilient to up to 20 mislabeled data points. Not saying this is a good scheme, just trying to offer a concrete illustration of how this problem seems tractable.
The capability amplification section also seems under-motivated to me. Paul writes: “If we start with a human, then RL will only ever produce human-level reasoning about long-term consequences or about “what is good.”” But absent problems like those you describe in this post, I’m inclined to agree with Eliezer that
In other words, if we are aiming for Bostrom’s maxipok (maximum probability of an OK outcome), it seems plausible to me that “merely” Paul’s level of moral reasoning is sufficient to get us there, especially if the keys to the universe get handed back. If this is our biggest alignment-specific problem, I might sooner allocate marginal research hours towards improving formal methods or something like that.
The hard part of “what is good” isn’t the moral part, it’s understanding things like “in this world, do humans actually have adequate control of and understanding of the situation?” Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn’t seem wise, regardless of your views on moral questions.
I agree that Paul level reasoning is fine if no one else is building AI systems with more powerful reasoning.
Interesting. I think there are some important but subtle distinctions here.
In the standard supervised learning setup, we provide a machine learning algorithm with some X (in this case, courses of action an AI could take) and some Y (in this case, essentially real numbers representing the degree to which we approve of the courses of action). The core challenge of machine learning is to develop a model which extrapolates well beyond this data. So then the question becomes… does it extrapolate well in the sense of accurately predicting Paul-level reasoning, including deficiencies Paul would exhibit when examining complex or deceptive scenarios that are at the limit of Paul’s ability to understand? Or does it extrapolate well in the sense of accurately predicting what Paul would desire on reflection, given access to all of the AI’s knowledge, cognitive resources, etc.?
Let’s assume for the sake of argument that all of the X and Y data is “good”, i.e. it doesn’t make the algorithm think it’s the first Paul which is supposed to get extrapolated by including a mistake that only the first Paul would make. I’ll talk about the case where we have some bad data at the end.
The standard way to measure the effectiveness of extrapolation in machine learning is to make use of a dev set. Unfortunately, that doesn’t help in this case because we don’t have access to labeled data from “Paul who has reflected a bunch given access to all of the AI’s knowledge, cognitive resources, etc.” If we did have access to such data, we could find a data point that the two Paul’s label differently and test the model on that. (However, we might do a similar sort of test by asking a child to provide some labeled data, then checking to see whether the model assigns nontrivial credence to the answers an adult gives on data points where the child and the adult disagree.)
In poetic terms, we want the system to be asking itself:
In the absence of data which distinguishes between two hypotheses, belief in one hypothesis or the other comes down to the choice of prior. So you want the AI’s cognitive architecture to be structured so that whatever concepts, learning capabilities, prediction capabilities, etc. which make it cognitively powerful also get re-used in the service of generating plausible extrapolations from the labeled data the user has provided. Then, if any of those extrapolations assign nontrivial credence to some plan being malign, that’s a strike against it.
Re: the bad data case, you might handle this using the same sort of techniques which are normally used for mislabeled or noisy data. For example, split the data into 30 folds, train an ensemble on every possible combination of 10 folds, and if any one of the resulting models objects to some action, nix it. Now we’re resilient to up to 20 mislabeled data points. Not saying this is a good scheme, just trying to offer a concrete illustration of how this problem seems tractable.