I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.
Humans can be motivated to effectively optimize arbitrary abstractly defined goals. And while pursuing such goals they can behave instrumentally efficiently, for example they can fight for their own survival or to acquire resources just as effectively as you or I.
Perhaps you can’t get the particular kind of guarantee you want in these cases, but it seems like we have a proof of concept (at least from a complexity-theoretic perspective).
ETA: Practically speaking I agree that a symbolic specification of preferences does not solve the problem, since such a specification isn’t suitable for training. The important point was that I don’t believe your approach dodges this particular difficulty.
“Humans can be motivated to effectively optimize arbitrary abstractly defined goals.”
I disagree. Humans can optimize only very special goals of this type. Our intuition about formal mathematical statements is calibrated using formal proofs. This means we can only estimate the likelihood of statements with short proofs / refutations.
To make it formal, consider a formal theory T and an algorithm P that, given a certain amount of time k to work, randomly constructs a formal proof / refutation of some sentence (i.e. it starts with the axioms and applies a random* sequence of inference rules). P produces sentences together with a bit which says the sentence is provable or refutable. This means P defines a word ensemble (with parameter k) w.r.t. which the partial function ϕ↦0 if ϕ is refutable, 1 if ϕ is provable is generatable. In particular, we can construct an optimal predictor for this function. However, this optimal predictor will only be well-behaved on sentences that are relatively likely to be produced by P i.e. sentences with short proofs / refutations. On arbitrary sentences it will do very poorly.
*Perhaps it’s in some sense better to think of P that randomly samples a program which then controls the application of inference rules, but it’s not essential for my point.
I agree that you almost certainly can’t get an optimal predictor. For similar reasons, you can’t train a supervised learner using any obvious approach. This is the reason that I am pessimistic about this kind of “abstract” goal.
That said, I’m not as pessimistic as you are.
Suppose that I define a very elaborate reflective process, which would be prohibitively complex to simulate and whose behavior is probably not constrained in any meaningful way by any short proofs.
I think that a human can in fact try to maximize the output of such a reflective process, “to the best of their abilities.” And this seems good enough for value alignment.
It’s not important that we actually achieve optimality except on shorter-term instrumentally important problems such as gathering resources (for which we can in fact expect the abstractly motivated algorithm to converge to optimality).
Humans can be motivated to effectively optimize arbitrary abstractly defined goals. And while pursuing such goals they can behave instrumentally efficiently, for example they can fight for their own survival or to acquire resources just as effectively as you or I.
Perhaps you can’t get the particular kind of guarantee you want in these cases, but it seems like we have a proof of concept (at least from a complexity-theoretic perspective).
ETA: Practically speaking I agree that a symbolic specification of preferences does not solve the problem, since such a specification isn’t suitable for training. The important point was that I don’t believe your approach dodges this particular difficulty.
“Humans can be motivated to effectively optimize arbitrary abstractly defined goals.”
I disagree. Humans can optimize only very special goals of this type. Our intuition about formal mathematical statements is calibrated using formal proofs. This means we can only estimate the likelihood of statements with short proofs / refutations.
To make it formal, consider a formal theory T and an algorithm P that, given a certain amount of time k to work, randomly constructs a formal proof / refutation of some sentence (i.e. it starts with the axioms and applies a random* sequence of inference rules). P produces sentences together with a bit which says the sentence is provable or refutable. This means P defines a word ensemble (with parameter k) w.r.t. which the partial function ϕ↦0 if ϕ is refutable, 1 if ϕ is provable is generatable. In particular, we can construct an optimal predictor for this function. However, this optimal predictor will only be well-behaved on sentences that are relatively likely to be produced by P i.e. sentences with short proofs / refutations. On arbitrary sentences it will do very poorly.
*Perhaps it’s in some sense better to think of P that randomly samples a program which then controls the application of inference rules, but it’s not essential for my point.
I agree that you almost certainly can’t get an optimal predictor. For similar reasons, you can’t train a supervised learner using any obvious approach. This is the reason that I am pessimistic about this kind of “abstract” goal.
That said, I’m not as pessimistic as you are.
Suppose that I define a very elaborate reflective process, which would be prohibitively complex to simulate and whose behavior is probably not constrained in any meaningful way by any short proofs.
I think that a human can in fact try to maximize the output of such a reflective process, “to the best of their abilities.” And this seems good enough for value alignment.
It’s not important that we actually achieve optimality except on shorter-term instrumentally important problems such as gathering resources (for which we can in fact expect the abstractly motivated algorithm to converge to optimality).