Certainly if you just have access to a weaker policy, this doesn’t make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (If you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach “superhuman performance” by our current standards; though of course capability amplification asks for a stronger guarantee: can we can do this in a particular structured way.
We say that A ⪰ B if we are at least as happy with policy A as with policy B(in any situation that we think might arise in practice).
This sounds like a partial order to me. But then:
C is reachable from A if there is a chain of policies in 𝒜 which starts at A and ends at C, and where each policy in the chain is no better than the amplification of the previous policy.
I interpret this as saying: If (A,B) form part of the chain, then B≯A+. But I believe that the property we want is A+≥B, which is a different condition if ≥ defines a partial ordering. Does that seem right to you?
I might rephrase it as “where the amplification of each policy in the chain is at least as good as the subsequent policy”.
We say that Cis reachable from Aif:
A⁺ ⪰ C, where A⁺ is the amplification as described in the last section; or
There is an intermediate B ∈ 𝓐 which is reachable from A and which can reach C.
It took me a while to realize why you went with this definition. I thought you were going for a simple recursive definition, in which case you could define C to be reachable from A if A≥C, or if C is reachable from A+. Equivalently, there is a chain of amplifications of A such that the resulting policy dominates C. The problem with this definition is that there isn’t a corresponding notion of obstructions for my definition, because it isn’t transitive. It is possible to have B reachable from A, and C reachable from B, but not C reachable from A.
On the other hand, I believe your definition is the transitive closure of the relation R, where (A,C)∈R iff A+≥C, and so a notion of obstructions comes out naturally.
Analogously, we say that a function L: 𝓐 → ℝ is an obstruction if our amplification procedure cannot always increase L.
… to its maximal value in 𝓐. (Obvious, but worth saying.)
An easy way to deal with this difficulty is to replace ‘at least as happy with policy A as with policy B (in any situation that we think might arise in practice)’ with ‘at least as happy with policy A as with policy B (when averaged over the distribution of situations that we expect to arise)’, though this is clearly much weaker.
To me it seems that the reason this stronger sense of ordering is used is because we expect this amplification procedure to be of a sort that produces results such that A+ is strictly better than A but that even if this wasn’t the case, the concept of an obstruction would still be a useful one. Perhaps it would be reasonable to take the more relaxed definition but expect that amplification would produce results that are strictly better.
I also agree with Chris below that defining an obstruction in terms of this ‘better than’ relation brings in serious difficulty. There are exponentially many policies Bthat are no better than A+ and there may well be a subset of these can be amplified beyond A+ but as far as I can tell there’s no clear way to identify these. We thus have an exponential obstacle to progress even within a partition, necessitating a stronger definition.
Certainly if you just have access to a weaker policy, this doesn’t make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (If you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach “superhuman performance” by our current standards; though of course capability amplification asks for a stronger guarantee: can we can do this in a particular structured way.
This sounds like a partial order to me. But then:
I interpret this as saying: If (A,B) form part of the chain, then B≯A+. But I believe that the property we want is A+≥B, which is a different condition if ≥ defines a partial ordering. Does that seem right to you?
I might rephrase it as “where the amplification of each policy in the chain is at least as good as the subsequent policy”.
It took me a while to realize why you went with this definition. I thought you were going for a simple recursive definition, in which case you could define C to be reachable from A if A≥C, or if C is reachable from A+. Equivalently, there is a chain of amplifications of A such that the resulting policy dominates C. The problem with this definition is that there isn’t a corresponding notion of obstructions for my definition, because it isn’t transitive. It is possible to have B reachable from A, and C reachable from B, but not C reachable from A.
On the other hand, I believe your definition is the transitive closure of the relation R, where (A,C)∈R iff A+≥C, and so a notion of obstructions comes out naturally.
… to its maximal value in 𝓐. (Obvious, but worth saying.)
An easy way to deal with this difficulty is to replace ‘at least as happy with policy A as with policy B (in any situation that we think might arise in practice)’ with ‘at least as happy with policy A as with policy B (when averaged over the distribution of situations that we expect to arise)’, though this is clearly much weaker.
To me it seems that the reason this stronger sense of ordering is used is because we expect this amplification procedure to be of a sort that produces results such that A+ is strictly better than A but that even if this wasn’t the case, the concept of an obstruction would still be a useful one. Perhaps it would be reasonable to take the more relaxed definition but expect that amplification would produce results that are strictly better.
I also agree with Chris below that defining an obstruction in terms of this ‘better than’ relation brings in serious difficulty. There are exponentially many policies B that are no better than A+ and there may well be a subset of these can be amplified beyond A+ but as far as I can tell there’s no clear way to identify these. We thus have an exponential obstacle to progress even within a partition, necessitating a stronger definition.