I agree that rational agents will eventually converge to behavior that is optimal according to some utility function, under suitable consistency conditions. Assuming that convergence has already occurred is dodging the difficulty I’m trying to point at.
My approach to indirect normativity is to allow the process of reflective deliberation to continue long enough that we expect it to have converged adequately. This doesn’t require any knowledge about the structure of the agents, just the ability to make predictions about what will happen as they reflect. (Reflection would include, for example, whatever deliberations the committee engages in in order to arrive at an effective compromise policy.)
You seem to be offering an alternative, which you hope will dodge some of the difficulties of my formulation of indirect normativity by leveraging something about the nature of rational behavior. If I am making observations of the committee’s performance, and the committee (like almost any real committee) has not converged to an efficient compromise policy, then it seems like your solution concept is supposed to recover this bargaining solution from the committee’s actual decisions. But it isn’t clear to me how that could work, except by inferring how the committee would reflect and then running that reflective process.
I am, of course, not convinced that we need a theory of physicalist agents (since if we run this reflective process then we might as well directly extract preferences over actions, folding the evaluation of consequences into the implicit predictions about these preferences over actions). It would be nice to have a better theoretical understanding. But that’s why I’m not working on the problem.
I think that inverse optimization done right will implicitly recover behavior under long deliberation without the need to specify a particular deliberation environment in detail.
Consider a single imperfectly rational agent A (I think that in light of my previous illustration there is no difference of principle between a single imperfectly rational agent and an imperfectly rational committee). I model the irrationality of the agent as the result of bounded computing power + a certain complex choice of logical prior. Specifically, consider again a game in extensive form but suppose the set of outcomes ¯O represents a set of “impossible possible worlds” and O⊆¯O where O is the space of “possible worlds.” Each node in the game tree is controlled either by A or by Nature. There are two probability distributions over Nature’s actions: the probability distribution p(a) assumed by A which includes A‘s logical uncertainty and the “true” (logically omniscient) probability distribution p∗(a). I assume that p dominates p∗ and that p∗ is s.t. regardless of A’s action only outcomes in O receive non-zero probability. Then, the same υ as before gives us a utility function ¯u:¯O→R which can be restricted to O to yield the “true” utility function u:O→R. This means that u will be predominantly influenced by game scenarios in which A deliberates a lot and updates its logical knowledge since other game scenarios are likely to yield outcomes outside of O (likely according to p).
As a simple example, suppose ¯O={0,1}3 where given o∈¯O, o1 = 0 (respectively 1) means the physics of the universe favors conventional medicine over homeopathy (respectively homeopathy over conventional medicine), o2=0 (respectively 1) means a double-blind controlled experiment came out in favor of conventional medicine (respectively homeopathy) and o3=0 (respectively 1) means A decided to take a conventional pill for its illness (respectively a homeopathic pill). O={(o1,o2,o3)∣o1=o2}. ¯u rewards outcomes in which o3=o1.
According to p, there is a 90% probability that o1=1. A is given a direct observation of o2 and then the game branches (according to Nature’s whim): in one branch A has no time to deliberate and in a different branch it has time to deliberate and make the logical observation that either ⌈o1=o2⌉ or ⌈o1=1−o2⌉ (decided by Nature). In the no-deliberation branch o1 and o2 are p-independent variables. In the deliberation branch o1 and o2 are p-coupled according to the logical observation made. On the other hand p∗ assigns probability 1 to the logical observation ⌈o1=o2⌉ and to o1=o2 in either branch (of course in a more realistic model o1=o2 wouldn’t be a logical necessity, only something likely w.r.t. a simplicity prior).
In this example, A will choose the homeopathy pill in the no-deliberation branch and either the homeopathy or the conventional pill in the deliberation branch, depending on the logical observation it makes (of course according to p∗ it always chooses the conventional pill in the deliberation branch). For an appropriate choice of p∗, A will be much more p∗-likely to choose a homeopathy pill. Nevertheless, the extracted u will prefer the conventional pill.
One question you may ask is how do I know this type of model is a good fit for human rationality. Obviously at this point I don’t have a complete answer but note that optimal predictors are defined by their asymptotic behavior which suggests that choosing a logical prior can yield arbitrary behavior in the short-deliberation mode. This leads to the converse objection that simultaneously looking for a utility function and a logical prior can lead to overfitting. Again, I don’t have an impeccable argument but I think that the correct formalism will strike a good balance between underfitting and overfitting because assigning human values to humans seems natural from the perspective of an external observer even when this observer never directly observes extremely long deliberation or has an external reason to care about such a scenario (which is the point I was trying to make with the Zerg example).
On the other hand, you made me realize that using hypothetical long deliberation for defining preferences over an action space (e.g. preferences over the set of possible programs for a physicalist AGI) might be a viable approach, although not without its own difficulties. I updated towards “it is possible to solve FAI without a theory of physicalist agents.”
EDIT: See my reply to Wei Dai for what I think is a strong argument why we can’t do without a theory of physicalist agents after all.
Suppose I’m studying Zerg values, and I find that many Zergs think their values depend on the solutions of various philosophical problems (the degree and nature of these dependencies are themselves open problems). It seems that the following are some potentially feasible ways for me to proceed:
Wait for the Zerg to solve the problems themselves and tell me the solutions.
Work out (or have the Zerg work out) an algorithm for philosophical reasoning or a formal definition of what it means for a solution to a philosophical problem to be correct, then apply computing power.
If my own philosophical abilities are sufficiently similar or superior to the Zerg’s, apply them.
Learn to do philosophy from the Zerg through imitation and/or other applicable forms of learning, then help them solve the philosophical problems.
Upload some Zergs into a computer and simulate them until they solve the problems.
Can you explain how your approach would solve this problem? Which of the above (if any) is the closest analogy to how your approach would solve this problem?
I think 5 is the closest analogy (in which respect it is similar to Paul’s approach).
Philosophical problems are questions about concepts that don’t have clear-cut semantics, but the agent implicitly uses some rules to reason about these questions (I think this metaphilosophical perspective is essentially the same as Wittgenstein’s “language games”). Therefore, having a model of the agent it is possible to deduce the rules and derive the answer.
For example, suppose the Zerg consider rubes sacred and bleggs abhorrent. The Zerg consider it indisputable that red cubes are rubes and blue egg-shaped objects are bleggs. The status of blue cubes is controversial: some Zerg argue that they are rubes, some that they are bleggs, some that they are neither and many are simply unsure. However, it is a logical fact that given sufficient time to deliberate, the Zerg would arrive at the conclusion that blue cubes are rubes.
This situation is structurally very similar to the example with homeopathic vs. conventional pills. It seems natural to model the Zerg as having logical uncertainty about the sentence “blue cubes are rubes” which means that if you can predict that given enough time they would update towards this sentence, you should assign positive Zerg-value to blue cubes.
By “Paul’s approach” you meant Paul’s “indirect normativity” approach, rather than his current “act-based” approach, right?
Can you compare your approach and Paul’s “indirect normativity” approach in more detail? In Paul’s approach, the FAI programmer specifies a utility function U which explicitly incorporates a simulated human in a virtual environment. I can certainly see how 5 is the closest analogy for his approach. In your approach, if I understand correctly, U is to be inferred by an algorithm using observations of human behavior as input. Are you expecting this U to also incorporate a simulated human? How do you expect your U to be different from Paul’s U? For example are the simulations at different levels of detail? Or is the main difference that you expect your approach to also produce a logical prior that will cause the FAI to behave in a reasonable way while it lacks enough computing power to reason deeply about U (e.g., while it can’t yet compute the outcome of the simulated human’s philosophical deliberations)?
The main advantage of my approach over Paul’s is in terms of the projected feasibility of creating an AGI with such a utility function.
Paul considers a specific simulation that consists of a human in a virtual environment in which the human deliberates over very long time, eventually producing a formal specification of a utility function using some kind of output channel the virtual environment provides (“virtual terminal”). From the AGI point of view the entire simulation is an arbitrary (possibly randomized) program that produces a string of bits that it interprets as the specification of a utility function, a program so costly that it cannot be executed (but presumably it is still able to reason about it).
This presupposes a way to formally specify a human. Such a way may either be constructed by hand (which requires a lot of neuroscience) or produced by the AGI from observations (although this requires care to avoid vulnerability to simulation warfare). In this respect the two approaches are quite similar.
My approach doesn’t assume either a virtual environment or a language in which the human specifies the utility function. Instead, the utility function is inferred directly from the specification of a human. Essentially the AGI is looking for the utility function U for which it would be most reasonable to say (in some specific formal sense) that “the given program A is an intelligent agent with utility function U.”
In order to find the right utility function, we need to first understand what type of object is “a utility function for an AGI.” This means both understanding the domain of the utility function and the properties the utility function must have. The first question is non-trivial mostly for physicalist agents, the second question is non-trivial for “logic optimizers” as well.
I expect that in order for it to be feasible to construct an AGI with given utility function, the utility function must belong to some complexity class. Presumably an agent should be able to reason about a utility function it cannot evaluate using logical uncertainty. Let’s examine this using the optimal predictor framework. Any function has an optimal predictor, however in general this predictor requires advice as expensive as the function itself. Thus even though agents with utility functions of arbitrary computational complexity are physically permissible, it is not feasible to construct such an agent given reasonable computational resources (i.e. the impediment is not the computational resources of the agent but the computational resources of the process constructing the agent, even though for self-improving agents it might be essentially the same). On the other hand, we know that some complexity classes (most notably the class of generatable problems) admit uniform (adviceless) optimal predictors. This suggests that any feasible utility function has to be in such a class.
Strictly speaking, complexity classes are defined by asymptotic behavior of resource requirements. So a problem that be solved within time n2+3↑10003 is still in the class P. In particular, we can consider utility functions with good asymptotic behavior and terrible “practical” behavior. I think that in this case we will be able to create an agent that can unboundedly self-improve in theory but in practice it will require infeasible time for it to grow from seed to anything useful.
The complexity class of Paul’s utility function depends on the language in which the simulated human specifies their output. However regardless of the language there is an enormous additive constant T associated with running the simulation itself. If there is uniform optimal predictor for e.g.EXP then the price we pay is only a polynomial in logT. However, this seems unlikely. If there was some a priori known structure in the simulation that can be exploited, there would be a chance of using an optimal predictor for some smaller class. However it doesn’t seem Paul’s construction provides any such structure.
On the other hand my approach seems to supply a lot of structure, and specifically there is a good chance that it is generatable. Generatable function are functions the graph of which can be sampled, which is more or less the same as functions admitting supervised learning (indeed my optimal predictor Λ is essentially a primitive “universal supervised learning” algorithm). From this perspective, my approach consists of first training a utility inference algorithm on agents with utility functions randomly sampled from some prior and then applying this algorithm to a human.
As usual, I imagine the human outputting a function V : actions →[0,1], rather than trying to resolve “what kind of object is a utility function?” in advance. I am not yet convinced that there is any problem with this.
I agree that such an abstractly-defined utility function is not implementable using any known techniques.
Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback.
I’m still not convinced that there is a working formulation along the lines that you propose. But if there is, then it seems clear that the instance for actual humans will be much too complex for us to actually compute with. It will also have other huge differences. For example, a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample. So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning.
Once we are already relying on such strong assumptions about transfer learning, it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals.
As usual, I am skeptical about this whole approach, and think that it should be at best plan B if we can’t get something with more direct feedback to work.
In particular, it seems to me that you won’t be able to build an efficient agent out of your approach unless either you solve this problem, or else you get a lucky break with respect to which techniques prove to be efficient. Do you have a different sense, or are you just more willing to tolerate inefficiency?
“As usual, I imagine the human outputting a function V : actions →[0,1]→[0,1], rather than trying to resolve ‘what kind of object is a utility function?’ in advance.”
Yeah, I realize this. This is why I pointed out that the domain problem is irrelevant for logic optimizers but the complexity class problem is still very relevant.
“I agree that such an abstractly-defined utility function is not implementable using any known techniques… it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals”
I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.
“Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback.”
Depends on the meaning of “compute.” Generatable functions allow efficiently sampling pairs (x,y) s.t. f(x)=E[y∣x]. This is very different from being able to efficiently evaluate f(x) for a given x. It seems likely that the evaluation of a generatable function can only be done in exponential time in general.
″...it seems clear that the instance for actual humans will be much too complex for us to actually compute with.”
I’m not sure I understand you. Perhaps you assume that we provide the AGI an inefficient specification of a human that it cannot actually run even for a relatively short timespan. Instead, either we provide it an efficient whole brain emulation or we provide observations and let the AGI build models itself. Of course this is very prone to mind crime but I think that your approach isn’t immune either.
″...a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample.”
You mean that since humans are adapted towards their current environment it is important to understand this context in order to understand human values? I believe this can be addressed in my framework. Imagine that the supervised learning procedure samples not just agents but environment + agent couples with the agent adapted to the environment to some extent. Feed the algorithm with observations of the agent and the environment. After training, feed the algorithm with observations of humans and the real environment.
“So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning.”
I don’t think so. An optimal predictor produces estimates which are as good as any efficient algorithm can produce, so if this problem indeed admits an optimal predictor (e.g. because it is generatable) then my agent will succeed at value inference as much as anything can succeed at it.
“In particular, it seems to me that you won’t be able to build an efficient agent out of your approach unless either you solve this [informed oversight] problem...”
The solution to the informed oversight problem is easy if U is generatable. In such cases you cannot construct a satisfactory B but you can still provide A with correct input-output pairs. In particular it works in your toy model because it’s easy to produce triplets of the form (x,O(zxx),0) and add them to the training set. Actually it works exactly for the same reason your A is able to exploit the naive approach.
In such cases you cannot construct a satisfactory B but you can still provide A with correct input-output pairs.
I agree that you can train a predictor to make an honest effort to predict U, but that is not good enough to train a policy to make an honest effort to maximize U.
You can train a model to predict U(x,y) from x,y. The trained model will in some sense be optimal, but it won’t necessarily be accurate. That is, there need not be any efficient algorithm to distinguish (x,O(zxx)) from (x,y) with U(x,y)=1. (Assuming one-way functions exist.)
The real question is how we train a policy A. If we have a sufficiently good predictor of U then we can use its predictions as a training signal. But in general there does not exist any sufficiently good predictor. So what do we do now?
You are right! It is insufficient for U to be generatable. What we need is the existence of an efficient randomized algorithm G s.t. when executed G produces a sample of x and then waits for receiving input y, after receiving which it gives an estimate of U(x,y). More formally G produces a pair (x,p) where p is a program and U(x,y)=E[ev(G()2;y)∣G()1=x]. Such a G can be used directly to train a maximizer.
Luckily my approach is within this class (at least superficially; but this lends hope for success after a deeper analysis). When you sample a utility function, a world and an agent and “render” a sequence of observations, you should be able to evaluate the output of the algorithm under training according to the selected utility function.
Actually, I think that if we consider only deterministic maximization policies then an optimal predictor for U wrt a bounded-Somonoff-type measure is sufficient to get an optimal maximization policy. In this case we can do maximization using Levin’s universal search L. A significantly better maximization policy M cannot exist since it would allow us to improve our estimates of U(x,M(x)) and/or U(x,L(x)).
Of course under standard assumptions about derandomization the best deterministic policy is about as good as the best random policy (on average), and in particular requiring U is generatable wrt a suitable measure is sufficient. Random only gives significant advantage in adversarial setups: your toy model of informed oversight is essentially an adversarial setup between A and B.
Also, obviously Levin search is grossly inefficient in practice (like the optimal predictor Λ which is basically a variant of Levin search) but this model suggests that applying a more practical learning algorithm would give satisfactory results.
Also, I think it’s possible to use probabilistic policies which produce computable distributions (since for such a policy outputs with low and high probabilities are always distinguishable).
Yes, this is what you need. I agree that your setup nearly has this property, in that you get out a complete description of the utility function which can be used to evaluate an action.
But my main concern is that even with a complete description of the utility function in hand, you still can’t efficiently evaluate it on possible actions, at least not without some further assumptions about the utility function (which I don’t think hold).
That is, I am concerned that even once you know the utility function, it will have the same character as the U in my toy model (because of logical uncertainty).
As a complexity-theoretically clean example, suppose that I am really unhappy if an output has a steganographically encoded message. Some policies will predictably produce steganographically encoded messages and therefore should receive low utility, but I can’t actually distinguish these messages from random messages.
This example is a bit unsatisfying, because steganographically encoding a message is harder than generating a random message. The novelty example in my post is more satisfying in that sense, but it introduces a much smaller complexity gap. It would be interesting to find a plausible example that has the best of both worlds. If there is no example, it would be very interesting to understand why there isn’t.
As I said, I expect it’s not meaningful to consider agents with arbitrary utility functions. As a working hypothesis, we can assume the utility function should be “sequentially generatable” i.e. have the property I previously described. The nice thing is that if we sample sequentially generatable utility functions then the total utility function (which includes utility inference in its definition) is also sequentially generatable.
To see this, consider a simple setup in which our agent A is exposed to observations o of “teacher” agent B after which A receives input x and produces output y. The total utility function U has the following “sequential generator” GU. GU samples a sequentially generatable utility function V with sequential generator GV. It then constructs an agent B with utility function V and runs the agent to generate o. It uses GV to generate x. After receiving y, it uses GV to compute an estimate of V(x,y) which thus serves as an estimate of U(o,x,y).
It is possible there is a larger complexity class of admissible utility functions (although it’s possible that there isn’t) but in that case I still hope the utility inference procedure or some variant thereof still preserves this class.
I still agree it would be interesting to construct a complexity-theoretic parallel of your toy model for informed oversight. The following is my attempt to do it (which I’m not sure really works).
Consider a directed graph G where each k-valent vertex is labeled by a tensor of rank k with rational coefficients (you can also use other coefficients e.g. a finite field) where each covariant index of the tensor corresponds to an incoming edge and each contravariant index to an outgoing edge. It is thus possible to contract the tensors according to the graph structure obtaining a number v(G)∈Q.
There is a natural concept of isomorphism for graphs as above s.t.G≃H implies v(G)≃v(H). That is, we consider regular graph isomorphisms where relabeling the edges transforms the tensors in the obvious way (it is also possible to consider a more general type of isomorphism where we are also allowed to do a “base change” at each edge, but it is not compatible with the following).
Given such a graph G, it is possible to apply a one-way function to each coefficient of each tensor (otherwise preserving all structure), getting a new graph G′. Obviously G≃H implies G′≃H′.
Suppose the input is a G as above and the output is another such graph H. U(G,H)=1 when v(G′)=v(H′) but not G≃H and U(G,H)=0 otherwise.
Clearly given G it’s easy to construct a random isomorphic H. However it might be hard to check for this kind of isomorphism (although regular graph isomorphism is claimed to be solvable in quasipolynomial time and might easily be solvable in polynomial time to the best of our knowledge). It is also likely to be hard to construct a non-isomorphic H with v(G′)=v(H′).
I don’t think so. An optimal predictor produces estimates which are as good as any efficient algorithm can produce, so if this problem indeed admits an optimal predictor (e.g. because it is generatable) then my agent will succeed at value inference as much as anything can succeed at it.
This is true asymptotically. The same argument can be applied to training an algorithm for pursuing symbolically defined goals, even using conventional techniques. The problem is that quantitatively these arguments have no teeth in the applications we care about, unless we can train on instances as complex as the one we actually care about.
I don’t know if we disagree here. We probably reach different conclusions because of our disagreement about whether you can generate instances of value inferences that are as complex as the real world.
I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.
Humans can be motivated to effectively optimize arbitrary abstractly defined goals. And while pursuing such goals they can behave instrumentally efficiently, for example they can fight for their own survival or to acquire resources just as effectively as you or I.
Perhaps you can’t get the particular kind of guarantee you want in these cases, but it seems like we have a proof of concept (at least from a complexity-theoretic perspective).
ETA: Practically speaking I agree that a symbolic specification of preferences does not solve the problem, since such a specification isn’t suitable for training. The important point was that I don’t believe your approach dodges this particular difficulty.
“Humans can be motivated to effectively optimize arbitrary abstractly defined goals.”
I disagree. Humans can optimize only very special goals of this type. Our intuition about formal mathematical statements is calibrated using formal proofs. This means we can only estimate the likelihood of statements with short proofs / refutations.
To make it formal, consider a formal theory T and an algorithm P that, given a certain amount of time k to work, randomly constructs a formal proof / refutation of some sentence (i.e. it starts with the axioms and applies a random* sequence of inference rules). P produces sentences together with a bit which says the sentence is provable or refutable. This means P defines a word ensemble (with parameter k) w.r.t. which the partial function ϕ↦0 if ϕ is refutable, 1 if ϕ is provable is generatable. In particular, we can construct an optimal predictor for this function. However, this optimal predictor will only be well-behaved on sentences that are relatively likely to be produced by P i.e. sentences with short proofs / refutations. On arbitrary sentences it will do very poorly.
*Perhaps it’s in some sense better to think of P that randomly samples a program which then controls the application of inference rules, but it’s not essential for my point.
I agree that you almost certainly can’t get an optimal predictor. For similar reasons, you can’t train a supervised learner using any obvious approach. This is the reason that I am pessimistic about this kind of “abstract” goal.
That said, I’m not as pessimistic as you are.
Suppose that I define a very elaborate reflective process, which would be prohibitively complex to simulate and whose behavior is probably not constrained in any meaningful way by any short proofs.
I think that a human can in fact try to maximize the output of such a reflective process, “to the best of their abilities.” And this seems good enough for value alignment.
It’s not important that we actually achieve optimality except on shorter-term instrumentally important problems such as gathering resources (for which we can in fact expect the abstractly motivated algorithm to converge to optimality).
Perhaps you assume that we provide the AGI an inefficient specification of a human that it cannot actually run even for a relatively short timespan. Instead, either we provide it an efficient whole brain emulation or we provide observations and let the AGI build models itself.
In order to train such an AI, we need to produce data of the form (observations of agents acting in an environment, corresponding learned utility function). In order for the human example to actually be in distribution, you would need for these agents + environments to be as sophisticated as humans acting in the real world.
So here is my impression of what you are proposing:
Sample some preferences
Sample a world including an agent imperfectly pursuing those preferences. This world should be as complex as our world.
Simulate some observations of that world
Evaluate some proposed actions according to the original preferences.
This is what seems implausible to me. Perhaps the clearest illustration of the difficulty:
I strongly suspect that we will build powerful AI systems (for which the alignment problem must be resolved) before we are able to simulate anything vaguely human-like, even setting aside the inverse problem (of simulating actual humans) altogether.
This seems pretty safe to me, given that a simulation of a vaguely human-like intelligence would itself constitute such a powerful AI system and there is no reason to think the two capabilities would arise simultaneously.
But in this case, it doesn’t seem like we can implement the proposed training process.
I think it’s reasonable to assume that any agent that meaningfully comprehends human values has fairly good models of humans. The agents on which we train don’t have to be as complex as a low-level emulation of the human brain, they only have to be as complex as the model humans the AGI is going to imagine. Similarly, the worlds on which we train don’t have to be as complex as a low-level emulation of physics, they only have to be as complex as the models that the model humans have for the real world.
The agents on which we train [...] only have to be as complex as the model humans the AGI is going to imagine.
This doesn’t seem right. The agents on which we train must look indistinguishable from (a distribution including) humans, from the agent’s standpoint. That is a much higher bar, even complexity-theoretically!
For example, the AI can tell that humans can solve Sudoku puzzles, even if it can’t figure out how to solve Sudoku in a human-like way. The same holds for any aspect of any problem which can be more easily verified than implemented. So it seems that an AI would need most human abilities in order to fool itself.
My concern about this approach is closely related to my concerns about imitation learning. The situation is in some ways better and in some ways worse for your proposal—for example, you don’t get to do meeting halfway, but you do get to use very inefficient algorithms. Note that the imitation learning approach also only needs to produce behavior that is indistinguishable from human as far as the agent can tell, so the condition is really quite similar.
But at any rate, I’m more interested in the informed oversight problem than this difficulty. If we could handle informed oversight, then I would be feeling very good about bootstrapped approval-direction (e.g. ALBA). Also note that if we could solve the informed oversight problem, then bootstrapped imitation learning would also work fine even if imitation learning is inefficient, for the reasons described here.
And if you can avoid the informed oversight problem I would really like to understand how. It looks to me like an essential step in getting an optimal policy in your sense of optimality, for the same reasons that I think it might be an essential part of getting any efficient aligned AI.
Consider an environment which consists of bitstrings x followed by f−1(x) where f is an invertible one-way function. If you apply my version of bounded Solomonoff induction to this environment, the model that will emerge is a model that generates y, computes x=f(y) and outputs x followed by y. Thus y appears in the internal state of the model before x even though x appears before y in the temporal sequence (and even though physical causality might flow from x to y is some sense).
Analogically, I think that in the Sudoku example the model human will have mental access to a “cheat sheet” of the Sudoku puzzle generated by the environment which is not visible to the AGI.
Some problems have this special structure, but it doesn’t seem like all of them do.
For a really straightforward example, consider a random SAT instance. What kind of cheat sheet is nature going to generate, without explicitly planting a solution? (Planted solutions look different from organic solutions to random instances.)
It might be that most human abilities are just as hard to verify as to execute, or else permit some kind of cheat sheet. (Or more generally, permit a cheat sheet for the part of the problem that is easier to verify than execute.) But at face value I don’t see why this would be.
It’s not clear how can you tell that a magic box solves random SAT instances. Some SAT instances are unsolvable so that the magic box will have to occasionally output ⊥ but it’s not clear how to distinguish between instances that are genuinely unsolvable and instances that you cannot solve yourself.
I guess there might be a distributional search problem S s.t.
(i) A solution can be verified in polynomial time.
(ii) Every instance has a solution.
(iii) There is no way to efficiently generate valid input-output pairs obeying a distribution which is indistinguishable from the right distribution.
I see no obvious reason such S cannot exist but I also don’t have an example. It would be interesting to find one.
In any case, it is not clear how can the agent learn to identify models like this. You can only train a predictor on models which admit sampling. On the other hand, if there is some kind of transfer learning that allows generalizing to some class of models which don’t admit sampling then it’s plausible there is also transfer learning that allows generalizing utility inference from sampleable agents to agents in the larger class. Also, it might be possible to train utility inference on some sort of data produced by the predictor instead of on raw observations s.t. it’s possible to sample the distribution of this intermediate data corresponding to complex agents.
But it’s hard to demonstrate something concrete without having some model of how the hypothetical “superpredictor” works. As far as I know, it might be impossible to effectively use unsampleable models.
The point was that we may be able to train an agent to do what we want, even in cases where we can’t effectively build a predictor.
Re: your example. You can do amplification to get exponentially close to certainty (choose instances that are satisfiable with 2⁄3 probability, and then consider the problem “solve at least half of these 1000 instances”). If you really want every instance to have a solution, then you can probably generate the instances pseudorandomly from a small enough seed and do a union bound.
By “predictor” I don’t mean something that produces exact predictions, I mean something that produces probabilistic predictions of given quantities. Maybe we should call it “inductor” to avoid conflation with optimal predictors (even though the concepts are closely related). As I said before, I think that an agent has to have a reasonable model of humans to follow human values. Moreover an agent that doesn’t have a reasonable model of humans is probably much less dangerous since it won’t be able to manipulate humans (although I guess the risk is still non-negligible).
The question is what kind of inductors are complexity-theoretically feasible and what class of models do these inductors correspond to. Bounded Solomonoff induction using Λ works on the class of samplable models. In machine learning language, inductors using samplable models are feasible since it is possible to train the inductors by sampling random such models (i.e. by sampling the bounded Solomonoff ensemble). On the other hand it’s not clear what broader classes of models are admissible if any.
That said, it seems plausible that if it’s feasible to construct inductors for a broader class, my procedure will remain efficient.
Model 1: The “superinductor” works by finding an efficient transformation q of the input sequence x and a good sampleable model for q(x). E.g.q(x) contains the results of substituting the observed SAT-solutions into the observed SAT-instances. In this model, we can apply my procedure by running utility inference on q(x) instead of x.
Model 2: The superinductor works via an optimal predictor for SAT*. I think that it should be relatively straightforward to show that given an optimal predictor for SAT + assuming the ability to design AGIs for target utility functions relative to a SAT oracle, there is an optimal predictor for the total utility function my utility inference procedure defines (after including external agents that run over a SAT oracle). Therefore it is possible to maximize the latter.
*A (poly,log)-optimal predictor for SAT cannot exist unless all sparse problems in NP have efficient heuristic algorithms in some sense, which is unlikely. On the other hand, there is no reason that I know why a (poly,0)-optimal predictor for SAT cannot exist.
I agree that rational agents will eventually converge to behavior that is optimal according to some utility function, under suitable consistency conditions. Assuming that convergence has already occurred is dodging the difficulty I’m trying to point at.
My approach to indirect normativity is to allow the process of reflective deliberation to continue long enough that we expect it to have converged adequately. This doesn’t require any knowledge about the structure of the agents, just the ability to make predictions about what will happen as they reflect. (Reflection would include, for example, whatever deliberations the committee engages in in order to arrive at an effective compromise policy.)
You seem to be offering an alternative, which you hope will dodge some of the difficulties of my formulation of indirect normativity by leveraging something about the nature of rational behavior. If I am making observations of the committee’s performance, and the committee (like almost any real committee) has not converged to an efficient compromise policy, then it seems like your solution concept is supposed to recover this bargaining solution from the committee’s actual decisions. But it isn’t clear to me how that could work, except by inferring how the committee would reflect and then running that reflective process.
I am, of course, not convinced that we need a theory of physicalist agents (since if we run this reflective process then we might as well directly extract preferences over actions, folding the evaluation of consequences into the implicit predictions about these preferences over actions). It would be nice to have a better theoretical understanding. But that’s why I’m not working on the problem.
I think that inverse optimization done right will implicitly recover behavior under long deliberation without the need to specify a particular deliberation environment in detail.
Consider a single imperfectly rational agent A (I think that in light of my previous illustration there is no difference of principle between a single imperfectly rational agent and an imperfectly rational committee). I model the irrationality of the agent as the result of bounded computing power + a certain complex choice of logical prior. Specifically, consider again a game in extensive form but suppose the set of outcomes ¯O represents a set of “impossible possible worlds” and O⊆¯O where O is the space of “possible worlds.” Each node in the game tree is controlled either by A or by Nature. There are two probability distributions over Nature’s actions: the probability distribution p(a) assumed by A which includes A‘s logical uncertainty and the “true” (logically omniscient) probability distribution p∗(a). I assume that p dominates p∗ and that p∗ is s.t. regardless of A’s action only outcomes in O receive non-zero probability. Then, the same υ as before gives us a utility function ¯u:¯O→R which can be restricted to O to yield the “true” utility function u:O→R. This means that u will be predominantly influenced by game scenarios in which A deliberates a lot and updates its logical knowledge since other game scenarios are likely to yield outcomes outside of O (likely according to p).
As a simple example, suppose ¯O={0,1}3 where given o∈¯O, o1 = 0 (respectively 1) means the physics of the universe favors conventional medicine over homeopathy (respectively homeopathy over conventional medicine), o2=0 (respectively 1) means a double-blind controlled experiment came out in favor of conventional medicine (respectively homeopathy) and o3=0 (respectively 1) means A decided to take a conventional pill for its illness (respectively a homeopathic pill). O={(o1,o2,o3)∣o1=o2}. ¯u rewards outcomes in which o3=o1.
According to p, there is a 90% probability that o1=1. A is given a direct observation of o2 and then the game branches (according to Nature’s whim): in one branch A has no time to deliberate and in a different branch it has time to deliberate and make the logical observation that either ⌈o1=o2⌉ or ⌈o1=1−o2⌉ (decided by Nature). In the no-deliberation branch o1 and o2 are p-independent variables. In the deliberation branch o1 and o2 are p-coupled according to the logical observation made. On the other hand p∗ assigns probability 1 to the logical observation ⌈o1=o2⌉ and to o1=o2 in either branch (of course in a more realistic model o1=o2 wouldn’t be a logical necessity, only something likely w.r.t. a simplicity prior).
In this example, A will choose the homeopathy pill in the no-deliberation branch and either the homeopathy or the conventional pill in the deliberation branch, depending on the logical observation it makes (of course according to p∗ it always chooses the conventional pill in the deliberation branch). For an appropriate choice of p∗, A will be much more p∗-likely to choose a homeopathy pill. Nevertheless, the extracted u will prefer the conventional pill.
One question you may ask is how do I know this type of model is a good fit for human rationality. Obviously at this point I don’t have a complete answer but note that optimal predictors are defined by their asymptotic behavior which suggests that choosing a logical prior can yield arbitrary behavior in the short-deliberation mode. This leads to the converse objection that simultaneously looking for a utility function and a logical prior can lead to overfitting. Again, I don’t have an impeccable argument but I think that the correct formalism will strike a good balance between underfitting and overfitting because assigning human values to humans seems natural from the perspective of an external observer even when this observer never directly observes extremely long deliberation or has an external reason to care about such a scenario (which is the point I was trying to make with the Zerg example).
On the other hand, you made me realize that using hypothetical long deliberation for defining preferences over an action space (e.g. preferences over the set of possible programs for a physicalist AGI) might be a viable approach, although not without its own difficulties. I updated towards “it is possible to solve FAI without a theory of physicalist agents.”
EDIT: See my reply to Wei Dai for what I think is a strong argument why we can’t do without a theory of physicalist agents after all.
Suppose I’m studying Zerg values, and I find that many Zergs think their values depend on the solutions of various philosophical problems (the degree and nature of these dependencies are themselves open problems). It seems that the following are some potentially feasible ways for me to proceed:
Wait for the Zerg to solve the problems themselves and tell me the solutions.
Work out (or have the Zerg work out) an algorithm for philosophical reasoning or a formal definition of what it means for a solution to a philosophical problem to be correct, then apply computing power.
If my own philosophical abilities are sufficiently similar or superior to the Zerg’s, apply them.
Learn to do philosophy from the Zerg through imitation and/or other applicable forms of learning, then help them solve the philosophical problems.
Upload some Zergs into a computer and simulate them until they solve the problems.
Can you explain how your approach would solve this problem? Which of the above (if any) is the closest analogy to how your approach would solve this problem?
I think 5 is the closest analogy (in which respect it is similar to Paul’s approach).
Philosophical problems are questions about concepts that don’t have clear-cut semantics, but the agent implicitly uses some rules to reason about these questions (I think this metaphilosophical perspective is essentially the same as Wittgenstein’s “language games”). Therefore, having a model of the agent it is possible to deduce the rules and derive the answer.
For example, suppose the Zerg consider rubes sacred and bleggs abhorrent. The Zerg consider it indisputable that red cubes are rubes and blue egg-shaped objects are bleggs. The status of blue cubes is controversial: some Zerg argue that they are rubes, some that they are bleggs, some that they are neither and many are simply unsure. However, it is a logical fact that given sufficient time to deliberate, the Zerg would arrive at the conclusion that blue cubes are rubes.
This situation is structurally very similar to the example with homeopathic vs. conventional pills. It seems natural to model the Zerg as having logical uncertainty about the sentence “blue cubes are rubes” which means that if you can predict that given enough time they would update towards this sentence, you should assign positive Zerg-value to blue cubes.
By “Paul’s approach” you meant Paul’s “indirect normativity” approach, rather than his current “act-based” approach, right?
Can you compare your approach and Paul’s “indirect normativity” approach in more detail? In Paul’s approach, the FAI programmer specifies a utility function U which explicitly incorporates a simulated human in a virtual environment. I can certainly see how 5 is the closest analogy for his approach. In your approach, if I understand correctly, U is to be inferred by an algorithm using observations of human behavior as input. Are you expecting this U to also incorporate a simulated human? How do you expect your U to be different from Paul’s U? For example are the simulations at different levels of detail? Or is the main difference that you expect your approach to also produce a logical prior that will cause the FAI to behave in a reasonable way while it lacks enough computing power to reason deeply about U (e.g., while it can’t yet compute the outcome of the simulated human’s philosophical deliberations)?
Yeah, I was referring to indirect normativity.
The main advantage of my approach over Paul’s is in terms of the projected feasibility of creating an AGI with such a utility function.
Paul considers a specific simulation that consists of a human in a virtual environment in which the human deliberates over very long time, eventually producing a formal specification of a utility function using some kind of output channel the virtual environment provides (“virtual terminal”). From the AGI point of view the entire simulation is an arbitrary (possibly randomized) program that produces a string of bits that it interprets as the specification of a utility function, a program so costly that it cannot be executed (but presumably it is still able to reason about it).
This presupposes a way to formally specify a human. Such a way may either be constructed by hand (which requires a lot of neuroscience) or produced by the AGI from observations (although this requires care to avoid vulnerability to simulation warfare). In this respect the two approaches are quite similar.
My approach doesn’t assume either a virtual environment or a language in which the human specifies the utility function. Instead, the utility function is inferred directly from the specification of a human. Essentially the AGI is looking for the utility function U for which it would be most reasonable to say (in some specific formal sense) that “the given program A is an intelligent agent with utility function U.”
In order to find the right utility function, we need to first understand what type of object is “a utility function for an AGI.” This means both understanding the domain of the utility function and the properties the utility function must have. The first question is non-trivial mostly for physicalist agents, the second question is non-trivial for “logic optimizers” as well.
I expect that in order for it to be feasible to construct an AGI with given utility function, the utility function must belong to some complexity class. Presumably an agent should be able to reason about a utility function it cannot evaluate using logical uncertainty. Let’s examine this using the optimal predictor framework. Any function has an optimal predictor, however in general this predictor requires advice as expensive as the function itself. Thus even though agents with utility functions of arbitrary computational complexity are physically permissible, it is not feasible to construct such an agent given reasonable computational resources (i.e. the impediment is not the computational resources of the agent but the computational resources of the process constructing the agent, even though for self-improving agents it might be essentially the same). On the other hand, we know that some complexity classes (most notably the class of generatable problems) admit uniform (adviceless) optimal predictors. This suggests that any feasible utility function has to be in such a class.
Strictly speaking, complexity classes are defined by asymptotic behavior of resource requirements. So a problem that be solved within time n2+3↑10003 is still in the class P. In particular, we can consider utility functions with good asymptotic behavior and terrible “practical” behavior. I think that in this case we will be able to create an agent that can unboundedly self-improve in theory but in practice it will require infeasible time for it to grow from seed to anything useful.
The complexity class of Paul’s utility function depends on the language in which the simulated human specifies their output. However regardless of the language there is an enormous additive constant T associated with running the simulation itself. If there is uniform optimal predictor for e.g.EXP then the price we pay is only a polynomial in logT. However, this seems unlikely. If there was some a priori known structure in the simulation that can be exploited, there would be a chance of using an optimal predictor for some smaller class. However it doesn’t seem Paul’s construction provides any such structure.
On the other hand my approach seems to supply a lot of structure, and specifically there is a good chance that it is generatable. Generatable function are functions the graph of which can be sampled, which is more or less the same as functions admitting supervised learning (indeed my optimal predictor Λ is essentially a primitive “universal supervised learning” algorithm). From this perspective, my approach consists of first training a utility inference algorithm on agents with utility functions randomly sampled from some prior and then applying this algorithm to a human.
As usual, I imagine the human outputting a function V : actions →[0,1], rather than trying to resolve “what kind of object is a utility function?” in advance. I am not yet convinced that there is any problem with this.
I agree that such an abstractly-defined utility function is not implementable using any known techniques.
Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback.
I’m still not convinced that there is a working formulation along the lines that you propose. But if there is, then it seems clear that the instance for actual humans will be much too complex for us to actually compute with. It will also have other huge differences. For example, a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample. So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning.
Once we are already relying on such strong assumptions about transfer learning, it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals.
As usual, I am skeptical about this whole approach, and think that it should be at best plan B if we can’t get something with more direct feedback to work.
In particular, it seems to me that you won’t be able to build an efficient agent out of your approach unless either you solve this problem, or else you get a lucky break with respect to which techniques prove to be efficient. Do you have a different sense, or are you just more willing to tolerate inefficiency?
“As usual, I imagine the human outputting a function V : actions →[0,1]→[0,1], rather than trying to resolve ‘what kind of object is a utility function?’ in advance.”
Yeah, I realize this. This is why I pointed out that the domain problem is irrelevant for logic optimizers but the complexity class problem is still very relevant.
“I agree that such an abstractly-defined utility function is not implementable using any known techniques… it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals”
I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.
“Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback.”
Depends on the meaning of “compute.” Generatable functions allow efficiently sampling pairs (x,y) s.t. f(x)=E[y∣x]. This is very different from being able to efficiently evaluate f(x) for a given x. It seems likely that the evaluation of a generatable function can only be done in exponential time in general.
″...it seems clear that the instance for actual humans will be much too complex for us to actually compute with.”
I’m not sure I understand you. Perhaps you assume that we provide the AGI an inefficient specification of a human that it cannot actually run even for a relatively short timespan. Instead, either we provide it an efficient whole brain emulation or we provide observations and let the AGI build models itself. Of course this is very prone to mind crime but I think that your approach isn’t immune either.
″...a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample.”
You mean that since humans are adapted towards their current environment it is important to understand this context in order to understand human values? I believe this can be addressed in my framework. Imagine that the supervised learning procedure samples not just agents but environment + agent couples with the agent adapted to the environment to some extent. Feed the algorithm with observations of the agent and the environment. After training, feed the algorithm with observations of humans and the real environment.
“So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning.”
I don’t think so. An optimal predictor produces estimates which are as good as any efficient algorithm can produce, so if this problem indeed admits an optimal predictor (e.g. because it is generatable) then my agent will succeed at value inference as much as anything can succeed at it.
“In particular, it seems to me that you won’t be able to build an efficient agent out of your approach unless either you solve this [informed oversight] problem...”
The solution to the informed oversight problem is easy if U is generatable. In such cases you cannot construct a satisfactory B but you can still provide A with correct input-output pairs. In particular it works in your toy model because it’s easy to produce triplets of the form (x,O(zxx),0) and add them to the training set. Actually it works exactly for the same reason your A is able to exploit the naive approach.
I agree that you can train a predictor to make an honest effort to predict U, but that is not good enough to train a policy to make an honest effort to maximize U.
You can train a model to predict U(x,y) from x,y. The trained model will in some sense be optimal, but it won’t necessarily be accurate. That is, there need not be any efficient algorithm to distinguish (x,O(zxx)) from (x,y) with U(x,y)=1. (Assuming one-way functions exist.)
The real question is how we train a policy A. If we have a sufficiently good predictor of U then we can use its predictions as a training signal. But in general there does not exist any sufficiently good predictor. So what do we do now?
You are right! It is insufficient for U to be generatable. What we need is the existence of an efficient randomized algorithm G s.t. when executed G produces a sample of x and then waits for receiving input y, after receiving which it gives an estimate of U(x,y). More formally G produces a pair (x,p) where p is a program and U(x,y)=E[ev(G()2;y)∣G()1=x]. Such a G can be used directly to train a maximizer.
Luckily my approach is within this class (at least superficially; but this lends hope for success after a deeper analysis). When you sample a utility function, a world and an agent and “render” a sequence of observations, you should be able to evaluate the output of the algorithm under training according to the selected utility function.
Actually, I think that if we consider only deterministic maximization policies then an optimal predictor for U wrt a bounded-Somonoff-type measure is sufficient to get an optimal maximization policy. In this case we can do maximization using Levin’s universal search L. A significantly better maximization policy M cannot exist since it would allow us to improve our estimates of U(x,M(x)) and/or U(x,L(x)).
Of course under standard assumptions about derandomization the best deterministic policy is about as good as the best random policy (on average), and in particular requiring U is generatable wrt a suitable measure is sufficient. Random only gives significant advantage in adversarial setups: your toy model of informed oversight is essentially an adversarial setup between A and B.
Also, obviously Levin search is grossly inefficient in practice (like the optimal predictor Λ which is basically a variant of Levin search) but this model suggests that applying a more practical learning algorithm would give satisfactory results.
Also, I think it’s possible to use probabilistic policies which produce computable distributions (since for such a policy outputs with low and high probabilities are always distinguishable).
Yes, this is what you need. I agree that your setup nearly has this property, in that you get out a complete description of the utility function which can be used to evaluate an action.
But my main concern is that even with a complete description of the utility function in hand, you still can’t efficiently evaluate it on possible actions, at least not without some further assumptions about the utility function (which I don’t think hold).
That is, I am concerned that even once you know the utility function, it will have the same character as the U in my toy model (because of logical uncertainty).
As a complexity-theoretically clean example, suppose that I am really unhappy if an output has a steganographically encoded message. Some policies will predictably produce steganographically encoded messages and therefore should receive low utility, but I can’t actually distinguish these messages from random messages.
This example is a bit unsatisfying, because steganographically encoding a message is harder than generating a random message. The novelty example in my post is more satisfying in that sense, but it introduces a much smaller complexity gap. It would be interesting to find a plausible example that has the best of both worlds. If there is no example, it would be very interesting to understand why there isn’t.
As I said, I expect it’s not meaningful to consider agents with arbitrary utility functions. As a working hypothesis, we can assume the utility function should be “sequentially generatable” i.e. have the property I previously described. The nice thing is that if we sample sequentially generatable utility functions then the total utility function (which includes utility inference in its definition) is also sequentially generatable.
To see this, consider a simple setup in which our agent A is exposed to observations o of “teacher” agent B after which A receives input x and produces output y. The total utility function U has the following “sequential generator” GU. GU samples a sequentially generatable utility function V with sequential generator GV. It then constructs an agent B with utility function V and runs the agent to generate o. It uses GV to generate x. After receiving y, it uses GV to compute an estimate of V(x,y) which thus serves as an estimate of U(o,x,y).
It is possible there is a larger complexity class of admissible utility functions (although it’s possible that there isn’t) but in that case I still hope the utility inference procedure or some variant thereof still preserves this class.
I still agree it would be interesting to construct a complexity-theoretic parallel of your toy model for informed oversight. The following is my attempt to do it (which I’m not sure really works).
Consider a directed graph G where each k-valent vertex is labeled by a tensor of rank k with rational coefficients (you can also use other coefficients e.g. a finite field) where each covariant index of the tensor corresponds to an incoming edge and each contravariant index to an outgoing edge. It is thus possible to contract the tensors according to the graph structure obtaining a number v(G)∈Q.
There is a natural concept of isomorphism for graphs as above s.t.G≃H implies v(G)≃v(H). That is, we consider regular graph isomorphisms where relabeling the edges transforms the tensors in the obvious way (it is also possible to consider a more general type of isomorphism where we are also allowed to do a “base change” at each edge, but it is not compatible with the following).
Given such a graph G, it is possible to apply a one-way function to each coefficient of each tensor (otherwise preserving all structure), getting a new graph G′. Obviously G≃H implies G′≃H′.
Suppose the input is a G as above and the output is another such graph H. U(G,H)=1 when v(G′)=v(H′) but not G≃H and U(G,H)=0 otherwise.
Clearly given G it’s easy to construct a random isomorphic H. However it might be hard to check for this kind of isomorphism (although regular graph isomorphism is claimed to be solvable in quasipolynomial time and might easily be solvable in polynomial time to the best of our knowledge). It is also likely to be hard to construct a non-isomorphic H with v(G′)=v(H′).
This is true asymptotically. The same argument can be applied to training an algorithm for pursuing symbolically defined goals, even using conventional techniques. The problem is that quantitatively these arguments have no teeth in the applications we care about, unless we can train on instances as complex as the one we actually care about.
I don’t know if we disagree here. We probably reach different conclusions because of our disagreement about whether you can generate instances of value inferences that are as complex as the real world.
Humans can be motivated to effectively optimize arbitrary abstractly defined goals. And while pursuing such goals they can behave instrumentally efficiently, for example they can fight for their own survival or to acquire resources just as effectively as you or I.
Perhaps you can’t get the particular kind of guarantee you want in these cases, but it seems like we have a proof of concept (at least from a complexity-theoretic perspective).
ETA: Practically speaking I agree that a symbolic specification of preferences does not solve the problem, since such a specification isn’t suitable for training. The important point was that I don’t believe your approach dodges this particular difficulty.
“Humans can be motivated to effectively optimize arbitrary abstractly defined goals.”
I disagree. Humans can optimize only very special goals of this type. Our intuition about formal mathematical statements is calibrated using formal proofs. This means we can only estimate the likelihood of statements with short proofs / refutations.
To make it formal, consider a formal theory T and an algorithm P that, given a certain amount of time k to work, randomly constructs a formal proof / refutation of some sentence (i.e. it starts with the axioms and applies a random* sequence of inference rules). P produces sentences together with a bit which says the sentence is provable or refutable. This means P defines a word ensemble (with parameter k) w.r.t. which the partial function ϕ↦0 if ϕ is refutable, 1 if ϕ is provable is generatable. In particular, we can construct an optimal predictor for this function. However, this optimal predictor will only be well-behaved on sentences that are relatively likely to be produced by P i.e. sentences with short proofs / refutations. On arbitrary sentences it will do very poorly.
*Perhaps it’s in some sense better to think of P that randomly samples a program which then controls the application of inference rules, but it’s not essential for my point.
I agree that you almost certainly can’t get an optimal predictor. For similar reasons, you can’t train a supervised learner using any obvious approach. This is the reason that I am pessimistic about this kind of “abstract” goal.
That said, I’m not as pessimistic as you are.
Suppose that I define a very elaborate reflective process, which would be prohibitively complex to simulate and whose behavior is probably not constrained in any meaningful way by any short proofs.
I think that a human can in fact try to maximize the output of such a reflective process, “to the best of their abilities.” And this seems good enough for value alignment.
It’s not important that we actually achieve optimality except on shorter-term instrumentally important problems such as gathering resources (for which we can in fact expect the abstractly motivated algorithm to converge to optimality).
I am referring to generatability throughout.
In order to train such an AI, we need to produce data of the form (observations of agents acting in an environment, corresponding learned utility function). In order for the human example to actually be in distribution, you would need for these agents + environments to be as sophisticated as humans acting in the real world.
So here is my impression of what you are proposing:
Sample some preferences
Sample a world including an agent imperfectly pursuing those preferences. This world should be as complex as our world.
Simulate some observations of that world
Evaluate some proposed actions according to the original preferences.
This is what seems implausible to me. Perhaps the clearest illustration of the difficulty:
I strongly suspect that we will build powerful AI systems (for which the alignment problem must be resolved) before we are able to simulate anything vaguely human-like, even setting aside the inverse problem (of simulating actual humans) altogether.
This seems pretty safe to me, given that a simulation of a vaguely human-like intelligence would itself constitute such a powerful AI system and there is no reason to think the two capabilities would arise simultaneously.
But in this case, it doesn’t seem like we can implement the proposed training process.
I think it’s reasonable to assume that any agent that meaningfully comprehends human values has fairly good models of humans. The agents on which we train don’t have to be as complex as a low-level emulation of the human brain, they only have to be as complex as the model humans the AGI is going to imagine. Similarly, the worlds on which we train don’t have to be as complex as a low-level emulation of physics, they only have to be as complex as the models that the model humans have for the real world.
This doesn’t seem right. The agents on which we train must look indistinguishable from (a distribution including) humans, from the agent’s standpoint. That is a much higher bar, even complexity-theoretically!
For example, the AI can tell that humans can solve Sudoku puzzles, even if it can’t figure out how to solve Sudoku in a human-like way. The same holds for any aspect of any problem which can be more easily verified than implemented. So it seems that an AI would need most human abilities in order to fool itself.
My concern about this approach is closely related to my concerns about imitation learning. The situation is in some ways better and in some ways worse for your proposal—for example, you don’t get to do meeting halfway, but you do get to use very inefficient algorithms. Note that the imitation learning approach also only needs to produce behavior that is indistinguishable from human as far as the agent can tell, so the condition is really quite similar.
But at any rate, I’m more interested in the informed oversight problem than this difficulty. If we could handle informed oversight, then I would be feeling very good about bootstrapped approval-direction (e.g. ALBA). Also note that if we could solve the informed oversight problem, then bootstrapped imitation learning would also work fine even if imitation learning is inefficient, for the reasons described here.
And if you can avoid the informed oversight problem I would really like to understand how. It looks to me like an essential step in getting an optimal policy in your sense of optimality, for the same reasons that I think it might be an essential part of getting any efficient aligned AI.
Consider an environment which consists of bitstrings x followed by f−1(x) where f is an invertible one-way function. If you apply my version of bounded Solomonoff induction to this environment, the model that will emerge is a model that generates y, computes x=f(y) and outputs x followed by y. Thus y appears in the internal state of the model before x even though x appears before y in the temporal sequence (and even though physical causality might flow from x to y is some sense).
Analogically, I think that in the Sudoku example the model human will have mental access to a “cheat sheet” of the Sudoku puzzle generated by the environment which is not visible to the AGI.
Some problems have this special structure, but it doesn’t seem like all of them do.
For a really straightforward example, consider a random SAT instance. What kind of cheat sheet is nature going to generate, without explicitly planting a solution? (Planted solutions look different from organic solutions to random instances.)
It might be that most human abilities are just as hard to verify as to execute, or else permit some kind of cheat sheet. (Or more generally, permit a cheat sheet for the part of the problem that is easier to verify than execute.) But at face value I don’t see why this would be.
It’s not clear how can you tell that a magic box solves random SAT instances. Some SAT instances are unsolvable so that the magic box will have to occasionally output ⊥ but it’s not clear how to distinguish between instances that are genuinely unsolvable and instances that you cannot solve yourself.
I guess there might be a distributional search problem S s.t.
(i) A solution can be verified in polynomial time.
(ii) Every instance has a solution.
(iii) There is no way to efficiently generate valid input-output pairs obeying a distribution which is indistinguishable from the right distribution.
I see no obvious reason such S cannot exist but I also don’t have an example. It would be interesting to find one.
In any case, it is not clear how can the agent learn to identify models like this. You can only train a predictor on models which admit sampling. On the other hand, if there is some kind of transfer learning that allows generalizing to some class of models which don’t admit sampling then it’s plausible there is also transfer learning that allows generalizing utility inference from sampleable agents to agents in the larger class. Also, it might be possible to train utility inference on some sort of data produced by the predictor instead of on raw observations s.t. it’s possible to sample the distribution of this intermediate data corresponding to complex agents.
But it’s hard to demonstrate something concrete without having some model of how the hypothetical “superpredictor” works. As far as I know, it might be impossible to effectively use unsampleable models.
The point was that we may be able to train an agent to do what we want, even in cases where we can’t effectively build a predictor.
Re: your example. You can do amplification to get exponentially close to certainty (choose instances that are satisfiable with 2⁄3 probability, and then consider the problem “solve at least half of these 1000 instances”). If you really want every instance to have a solution, then you can probably generate the instances pseudorandomly from a small enough seed and do a union bound.
By “predictor” I don’t mean something that produces exact predictions, I mean something that produces probabilistic predictions of given quantities. Maybe we should call it “inductor” to avoid conflation with optimal predictors (even though the concepts are closely related). As I said before, I think that an agent has to have a reasonable model of humans to follow human values. Moreover an agent that doesn’t have a reasonable model of humans is probably much less dangerous since it won’t be able to manipulate humans (although I guess the risk is still non-negligible).
The question is what kind of inductors are complexity-theoretically feasible and what class of models do these inductors correspond to. Bounded Solomonoff induction using Λ works on the class of samplable models. In machine learning language, inductors using samplable models are feasible since it is possible to train the inductors by sampling random such models (i.e. by sampling the bounded Solomonoff ensemble). On the other hand it’s not clear what broader classes of models are admissible if any.
That said, it seems plausible that if it’s feasible to construct inductors for a broader class, my procedure will remain efficient.
Model 1: The “superinductor” works by finding an efficient transformation q of the input sequence x and a good sampleable model for q(x). E.g.q(x) contains the results of substituting the observed SAT-solutions into the observed SAT-instances. In this model, we can apply my procedure by running utility inference on q(x) instead of x.
Model 2: The superinductor works via an optimal predictor for SAT*. I think that it should be relatively straightforward to show that given an optimal predictor for SAT + assuming the ability to design AGIs for target utility functions relative to a SAT oracle, there is an optimal predictor for the total utility function my utility inference procedure defines (after including external agents that run over a SAT oracle). Therefore it is possible to maximize the latter.
*A (poly,log)-optimal predictor for SAT cannot exist unless all sparse problems in NP have efficient heuristic algorithms in some sense, which is unlikely. On the other hand, there is no reason that I know why a (poly,0)-optimal predictor for SAT cannot exist.