I think it’s reasonable to assume that any agent that meaningfully comprehends human values has fairly good models of humans. The agents on which we train don’t have to be as complex as a low-level emulation of the human brain, they only have to be as complex as the model humans the AGI is going to imagine. Similarly, the worlds on which we train don’t have to be as complex as a low-level emulation of physics, they only have to be as complex as the models that the model humans have for the real world.
The agents on which we train [...] only have to be as complex as the model humans the AGI is going to imagine.
This doesn’t seem right. The agents on which we train must look indistinguishable from (a distribution including) humans, from the agent’s standpoint. That is a much higher bar, even complexity-theoretically!
For example, the AI can tell that humans can solve Sudoku puzzles, even if it can’t figure out how to solve Sudoku in a human-like way. The same holds for any aspect of any problem which can be more easily verified than implemented. So it seems that an AI would need most human abilities in order to fool itself.
My concern about this approach is closely related to my concerns about imitation learning. The situation is in some ways better and in some ways worse for your proposal—for example, you don’t get to do meeting halfway, but you do get to use very inefficient algorithms. Note that the imitation learning approach also only needs to produce behavior that is indistinguishable from human as far as the agent can tell, so the condition is really quite similar.
But at any rate, I’m more interested in the informed oversight problem than this difficulty. If we could handle informed oversight, then I would be feeling very good about bootstrapped approval-direction (e.g. ALBA). Also note that if we could solve the informed oversight problem, then bootstrapped imitation learning would also work fine even if imitation learning is inefficient, for the reasons described here.
And if you can avoid the informed oversight problem I would really like to understand how. It looks to me like an essential step in getting an optimal policy in your sense of optimality, for the same reasons that I think it might be an essential part of getting any efficient aligned AI.
Consider an environment which consists of bitstrings x followed by f−1(x) where f is an invertible one-way function. If you apply my version of bounded Solomonoff induction to this environment, the model that will emerge is a model that generates y, computes x=f(y) and outputs x followed by y. Thus y appears in the internal state of the model before x even though x appears before y in the temporal sequence (and even though physical causality might flow from x to y is some sense).
Analogically, I think that in the Sudoku example the model human will have mental access to a “cheat sheet” of the Sudoku puzzle generated by the environment which is not visible to the AGI.
Some problems have this special structure, but it doesn’t seem like all of them do.
For a really straightforward example, consider a random SAT instance. What kind of cheat sheet is nature going to generate, without explicitly planting a solution? (Planted solutions look different from organic solutions to random instances.)
It might be that most human abilities are just as hard to verify as to execute, or else permit some kind of cheat sheet. (Or more generally, permit a cheat sheet for the part of the problem that is easier to verify than execute.) But at face value I don’t see why this would be.
It’s not clear how can you tell that a magic box solves random SAT instances. Some SAT instances are unsolvable so that the magic box will have to occasionally output ⊥ but it’s not clear how to distinguish between instances that are genuinely unsolvable and instances that you cannot solve yourself.
I guess there might be a distributional search problem S s.t.
(i) A solution can be verified in polynomial time.
(ii) Every instance has a solution.
(iii) There is no way to efficiently generate valid input-output pairs obeying a distribution which is indistinguishable from the right distribution.
I see no obvious reason such S cannot exist but I also don’t have an example. It would be interesting to find one.
In any case, it is not clear how can the agent learn to identify models like this. You can only train a predictor on models which admit sampling. On the other hand, if there is some kind of transfer learning that allows generalizing to some class of models which don’t admit sampling then it’s plausible there is also transfer learning that allows generalizing utility inference from sampleable agents to agents in the larger class. Also, it might be possible to train utility inference on some sort of data produced by the predictor instead of on raw observations s.t. it’s possible to sample the distribution of this intermediate data corresponding to complex agents.
But it’s hard to demonstrate something concrete without having some model of how the hypothetical “superpredictor” works. As far as I know, it might be impossible to effectively use unsampleable models.
The point was that we may be able to train an agent to do what we want, even in cases where we can’t effectively build a predictor.
Re: your example. You can do amplification to get exponentially close to certainty (choose instances that are satisfiable with 2⁄3 probability, and then consider the problem “solve at least half of these 1000 instances”). If you really want every instance to have a solution, then you can probably generate the instances pseudorandomly from a small enough seed and do a union bound.
By “predictor” I don’t mean something that produces exact predictions, I mean something that produces probabilistic predictions of given quantities. Maybe we should call it “inductor” to avoid conflation with optimal predictors (even though the concepts are closely related). As I said before, I think that an agent has to have a reasonable model of humans to follow human values. Moreover an agent that doesn’t have a reasonable model of humans is probably much less dangerous since it won’t be able to manipulate humans (although I guess the risk is still non-negligible).
The question is what kind of inductors are complexity-theoretically feasible and what class of models do these inductors correspond to. Bounded Solomonoff induction using Λ works on the class of samplable models. In machine learning language, inductors using samplable models are feasible since it is possible to train the inductors by sampling random such models (i.e. by sampling the bounded Solomonoff ensemble). On the other hand it’s not clear what broader classes of models are admissible if any.
That said, it seems plausible that if it’s feasible to construct inductors for a broader class, my procedure will remain efficient.
Model 1: The “superinductor” works by finding an efficient transformation q of the input sequence x and a good sampleable model for q(x). E.g.q(x) contains the results of substituting the observed SAT-solutions into the observed SAT-instances. In this model, we can apply my procedure by running utility inference on q(x) instead of x.
Model 2: The superinductor works via an optimal predictor for SAT*. I think that it should be relatively straightforward to show that given an optimal predictor for SAT + assuming the ability to design AGIs for target utility functions relative to a SAT oracle, there is an optimal predictor for the total utility function my utility inference procedure defines (after including external agents that run over a SAT oracle). Therefore it is possible to maximize the latter.
*A (poly,log)-optimal predictor for SAT cannot exist unless all sparse problems in NP have efficient heuristic algorithms in some sense, which is unlikely. On the other hand, there is no reason that I know why a (poly,0)-optimal predictor for SAT cannot exist.
I think it’s reasonable to assume that any agent that meaningfully comprehends human values has fairly good models of humans. The agents on which we train don’t have to be as complex as a low-level emulation of the human brain, they only have to be as complex as the model humans the AGI is going to imagine. Similarly, the worlds on which we train don’t have to be as complex as a low-level emulation of physics, they only have to be as complex as the models that the model humans have for the real world.
This doesn’t seem right. The agents on which we train must look indistinguishable from (a distribution including) humans, from the agent’s standpoint. That is a much higher bar, even complexity-theoretically!
For example, the AI can tell that humans can solve Sudoku puzzles, even if it can’t figure out how to solve Sudoku in a human-like way. The same holds for any aspect of any problem which can be more easily verified than implemented. So it seems that an AI would need most human abilities in order to fool itself.
My concern about this approach is closely related to my concerns about imitation learning. The situation is in some ways better and in some ways worse for your proposal—for example, you don’t get to do meeting halfway, but you do get to use very inefficient algorithms. Note that the imitation learning approach also only needs to produce behavior that is indistinguishable from human as far as the agent can tell, so the condition is really quite similar.
But at any rate, I’m more interested in the informed oversight problem than this difficulty. If we could handle informed oversight, then I would be feeling very good about bootstrapped approval-direction (e.g. ALBA). Also note that if we could solve the informed oversight problem, then bootstrapped imitation learning would also work fine even if imitation learning is inefficient, for the reasons described here.
And if you can avoid the informed oversight problem I would really like to understand how. It looks to me like an essential step in getting an optimal policy in your sense of optimality, for the same reasons that I think it might be an essential part of getting any efficient aligned AI.
Consider an environment which consists of bitstrings x followed by f−1(x) where f is an invertible one-way function. If you apply my version of bounded Solomonoff induction to this environment, the model that will emerge is a model that generates y, computes x=f(y) and outputs x followed by y. Thus y appears in the internal state of the model before x even though x appears before y in the temporal sequence (and even though physical causality might flow from x to y is some sense).
Analogically, I think that in the Sudoku example the model human will have mental access to a “cheat sheet” of the Sudoku puzzle generated by the environment which is not visible to the AGI.
Some problems have this special structure, but it doesn’t seem like all of them do.
For a really straightforward example, consider a random SAT instance. What kind of cheat sheet is nature going to generate, without explicitly planting a solution? (Planted solutions look different from organic solutions to random instances.)
It might be that most human abilities are just as hard to verify as to execute, or else permit some kind of cheat sheet. (Or more generally, permit a cheat sheet for the part of the problem that is easier to verify than execute.) But at face value I don’t see why this would be.
It’s not clear how can you tell that a magic box solves random SAT instances. Some SAT instances are unsolvable so that the magic box will have to occasionally output ⊥ but it’s not clear how to distinguish between instances that are genuinely unsolvable and instances that you cannot solve yourself.
I guess there might be a distributional search problem S s.t.
(i) A solution can be verified in polynomial time.
(ii) Every instance has a solution.
(iii) There is no way to efficiently generate valid input-output pairs obeying a distribution which is indistinguishable from the right distribution.
I see no obvious reason such S cannot exist but I also don’t have an example. It would be interesting to find one.
In any case, it is not clear how can the agent learn to identify models like this. You can only train a predictor on models which admit sampling. On the other hand, if there is some kind of transfer learning that allows generalizing to some class of models which don’t admit sampling then it’s plausible there is also transfer learning that allows generalizing utility inference from sampleable agents to agents in the larger class. Also, it might be possible to train utility inference on some sort of data produced by the predictor instead of on raw observations s.t. it’s possible to sample the distribution of this intermediate data corresponding to complex agents.
But it’s hard to demonstrate something concrete without having some model of how the hypothetical “superpredictor” works. As far as I know, it might be impossible to effectively use unsampleable models.
The point was that we may be able to train an agent to do what we want, even in cases where we can’t effectively build a predictor.
Re: your example. You can do amplification to get exponentially close to certainty (choose instances that are satisfiable with 2⁄3 probability, and then consider the problem “solve at least half of these 1000 instances”). If you really want every instance to have a solution, then you can probably generate the instances pseudorandomly from a small enough seed and do a union bound.
By “predictor” I don’t mean something that produces exact predictions, I mean something that produces probabilistic predictions of given quantities. Maybe we should call it “inductor” to avoid conflation with optimal predictors (even though the concepts are closely related). As I said before, I think that an agent has to have a reasonable model of humans to follow human values. Moreover an agent that doesn’t have a reasonable model of humans is probably much less dangerous since it won’t be able to manipulate humans (although I guess the risk is still non-negligible).
The question is what kind of inductors are complexity-theoretically feasible and what class of models do these inductors correspond to. Bounded Solomonoff induction using Λ works on the class of samplable models. In machine learning language, inductors using samplable models are feasible since it is possible to train the inductors by sampling random such models (i.e. by sampling the bounded Solomonoff ensemble). On the other hand it’s not clear what broader classes of models are admissible if any.
That said, it seems plausible that if it’s feasible to construct inductors for a broader class, my procedure will remain efficient.
Model 1: The “superinductor” works by finding an efficient transformation q of the input sequence x and a good sampleable model for q(x). E.g.q(x) contains the results of substituting the observed SAT-solutions into the observed SAT-instances. In this model, we can apply my procedure by running utility inference on q(x) instead of x.
Model 2: The superinductor works via an optimal predictor for SAT*. I think that it should be relatively straightforward to show that given an optimal predictor for SAT + assuming the ability to design AGIs for target utility functions relative to a SAT oracle, there is an optimal predictor for the total utility function my utility inference procedure defines (after including external agents that run over a SAT oracle). Therefore it is possible to maximize the latter.
*A (poly,log)-optimal predictor for SAT cannot exist unless all sparse problems in NP have efficient heuristic algorithms in some sense, which is unlikely. On the other hand, there is no reason that I know why a (poly,0)-optimal predictor for SAT cannot exist.