Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn’t seem wise, regardless of your views on moral questions.
Interesting. I think there are some important but subtle distinctions here.
In the standard supervised learning setup, we provide a machine learning algorithm with some X (in this case, courses of action an AI could take) and some Y (in this case, essentially real numbers representing the degree to which we approve of the courses of action). The core challenge of machine learning is to develop a model which extrapolates well beyond this data. So then the question becomes… does it extrapolate well in the sense of accurately predicting Paul-level reasoning, including deficiencies Paul would exhibit when examining complex or deceptive scenarios that are at the limit of Paul’s ability to understand? Or does it extrapolate well in the sense of accurately predicting what Paul would desire on reflection, given access to all of the AI’s knowledge, cognitive resources, etc.?
Let’s assume for the sake of argument that all of the X and Y data is “good”, i.e. it doesn’t make the algorithm think it’s the first Paul which is supposed to get extrapolated by including a mistake that only the first Paul would make. I’ll talk about the case where we have some bad data at the end.
The standard way to measure the effectiveness of extrapolation in machine learning is to make use of a dev set. Unfortunately, that doesn’t help in this case because we don’t have access to labeled data from “Paul who has reflected a bunch given access to all of the AI’s knowledge, cognitive resources, etc.” If we did have access to such data, we could find a data point that the two Paul’s label differently and test the model on that. (However, we might do a similar sort of test by asking a child to provide some labeled data, then checking to see whether the model assigns nontrivial credence to the answers an adult gives on data points where the child and the adult disagree.)
In poetic terms, we want the system to be asking itself:
Is there a plausible model that fits the labeled data I’ve been given which leads me to believe this world is not one in which humans actually have adequate control and understanding of the situation? Does there exist some model for the user’s preferences such that I assign a decently high prior to this model, the model fits the labeled data I’ve been given, and when this model is extrapolated to this [malign] clever scheme I’ve dreamed up, it returns either “this scheme is too complicated for me to evaluate and it should be penalized on that basis” or “this scheme is just bad”?
In the absence of data which distinguishes between two hypotheses, belief in one hypothesis or the other comes down to the choice of prior. So you want the AI’s cognitive architecture to be structured so that whatever concepts, learning capabilities, prediction capabilities, etc. which make it cognitively powerful also get re-used in the service of generating plausible extrapolations from the labeled data the user has provided. Then, if any of those extrapolations assign nontrivial credence to some plan being malign, that’s a strike against it.
Re: the bad data case, you might handle this using the same sort of techniques which are normally used for mislabeled or noisy data. For example, split the data into 30 folds, train an ensemble on every possible combination of 10 folds, and if any one of the resulting models objects to some action, nix it. Now we’re resilient to up to 20 mislabeled data points. Not saying this is a good scheme, just trying to offer a concrete illustration of how this problem seems tractable.
Interesting. I think there are some important but subtle distinctions here.
In the standard supervised learning setup, we provide a machine learning algorithm with some X (in this case, courses of action an AI could take) and some Y (in this case, essentially real numbers representing the degree to which we approve of the courses of action). The core challenge of machine learning is to develop a model which extrapolates well beyond this data. So then the question becomes… does it extrapolate well in the sense of accurately predicting Paul-level reasoning, including deficiencies Paul would exhibit when examining complex or deceptive scenarios that are at the limit of Paul’s ability to understand? Or does it extrapolate well in the sense of accurately predicting what Paul would desire on reflection, given access to all of the AI’s knowledge, cognitive resources, etc.?
Let’s assume for the sake of argument that all of the X and Y data is “good”, i.e. it doesn’t make the algorithm think it’s the first Paul which is supposed to get extrapolated by including a mistake that only the first Paul would make. I’ll talk about the case where we have some bad data at the end.
The standard way to measure the effectiveness of extrapolation in machine learning is to make use of a dev set. Unfortunately, that doesn’t help in this case because we don’t have access to labeled data from “Paul who has reflected a bunch given access to all of the AI’s knowledge, cognitive resources, etc.” If we did have access to such data, we could find a data point that the two Paul’s label differently and test the model on that. (However, we might do a similar sort of test by asking a child to provide some labeled data, then checking to see whether the model assigns nontrivial credence to the answers an adult gives on data points where the child and the adult disagree.)
In poetic terms, we want the system to be asking itself:
In the absence of data which distinguishes between two hypotheses, belief in one hypothesis or the other comes down to the choice of prior. So you want the AI’s cognitive architecture to be structured so that whatever concepts, learning capabilities, prediction capabilities, etc. which make it cognitively powerful also get re-used in the service of generating plausible extrapolations from the labeled data the user has provided. Then, if any of those extrapolations assign nontrivial credence to some plan being malign, that’s a strike against it.
Re: the bad data case, you might handle this using the same sort of techniques which are normally used for mislabeled or noisy data. For example, split the data into 30 folds, train an ensemble on every possible combination of 10 folds, and if any one of the resulting models objects to some action, nix it. Now we’re resilient to up to 20 mislabeled data points. Not saying this is a good scheme, just trying to offer a concrete illustration of how this problem seems tractable.