I’ve heard it claimed that better calibration is not the way to solve AI safety, but it seems like a promising solution to the transit design problem. Suppose we have a brilliant Bayesian machine learning system. Given a labeled dataset of transit system designs we approve/disapprove of, our system estimates the probability that any given model is the “correct” model which separates good designs from bad designs. Now consider two models chosen for the sake of argument: a “human approval” model and an “actual preferences” model. The probability of the “human approval” model will be rated very high. But I’d argue that the probability of the “actual preferences” model will also be rated rather high, because the labeled dataset we provide will be broadly compatible with our actual preferences. As long as the system assigns a reasonably high prior probability to our actual preferences, and the likelihood of the labels given our actual preferences is reasonably high, we should be OK.
Then instead of aiming for a design which is easy to compose, we aim for a design whose probability of being good is maximal when the model gets summed out. This means we’re maximizing an objective which includes a wide variety of models which are broadly compatible with the labeled data… including, in particular, our “actual preferences”.
In other words, find many reasonable ways of extrapolating the labeled data, and select a transit system which is OK according to all of them. (Or even select a transit system which is OK according to half of them, then use the other half as a test set. Note that it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.)
I’d argue from a safety point of view, it’s more important to have an acceptable transit system than an optimal transit system. Similarly, the goal with our first AGI should be to put the world on an acceptable trajectory, not the optimal trajectory. If the world is on an acceptable trajectory, we can always work to improve things. If the world shifts to an unacceptable trajectory, we may not be able to improve things. So to a first approximation, our first AGI should work to minimize the odds that the world is on an unacceptable trajectory, according to its subjective estimate of what constitutes an unacceptable trajectory.
I’ve heard it claimed that better calibration is not the way to solve AI safety, but it seems like a promising solution to the transit design problem. Suppose we have a brilliant Bayesian machine learning system. Given a labeled dataset of transit system designs we approve/disapprove of, our system estimates the probability that any given model is the “correct” model which separates good designs from bad designs. Now consider two models chosen for the sake of argument: a “human approval” model and an “actual preferences” model. The probability of the “human approval” model will be rated very high. But I’d argue that the probability of the “actual preferences” model will also be rated rather high, because the labeled dataset we provide will be broadly compatible with our actual preferences. As long as the system assigns a reasonably high prior probability to our actual preferences, and the likelihood of the labels given our actual preferences is reasonably high, we should be OK.
Then instead of aiming for a design which is easy to compose, we aim for a design whose probability of being good is maximal when the model gets summed out. This means we’re maximizing an objective which includes a wide variety of models which are broadly compatible with the labeled data… including, in particular, our “actual preferences”.
In other words, find many reasonable ways of extrapolating the labeled data, and select a transit system which is OK according to all of them. (Or even select a transit system which is OK according to half of them, then use the other half as a test set. Note that it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.)
I’d argue from a safety point of view, it’s more important to have an acceptable transit system than an optimal transit system. Similarly, the goal with our first AGI should be to put the world on an acceptable trajectory, not the optimal trajectory. If the world is on an acceptable trajectory, we can always work to improve things. If the world shifts to an unacceptable trajectory, we may not be able to improve things. So to a first approximation, our first AGI should work to minimize the odds that the world is on an unacceptable trajectory, according to its subjective estimate of what constitutes an unacceptable trajectory.