Vaniver comments on Challenges to Christiano’s capability amplification proposal

Vaniver 27 May 2018 16:39 UTC
4 points
These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much “larger” than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator.
I’m actually not sure the brute force search gives you what you’re looking for here. There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space (this is often possible, but not necessarily possible; consider solutions with real variables where a simple discretization is not obviously valid). Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.” (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
My intuition is that many numerical optimization search processes have “wide” state, in that you both are thinking about the place where you are right now, and the places you’ve been before, and previous judgments you’ve made about places to go. Sometimes this state is not actually wide because it can be compressed very nicely; for example, in the simplex algorithm, my state is entirely captured by the tableau, and I can spin up different agents to take the tableau and move it forward one step and then pass the problem along to another agent. But my intuition is that such times will be times when we’re not really concerned about daemons or misalignment of the optimization process itself, because the whole procedure is simple enough that we understand how everything works together well.
But if it is wide or deep, then it seems like this strategy is probably going to run into obstacles. We either attempt to implement something deep as the equivalent of recursive function calls, or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state).
For example, think about implementing tree search for games in this way. No one agent sees the whole tree, and only determines which children to pass messages to and what message to return to their parents. If we think that the different branches are totally distinct from each other, then we only need vertical message-passing and we can have separate short-lived agents (although it’s sort of hard to see the difference between an agent that’s implementing tree-search in one thread and many threads because of how single agents can implement recursive functions). But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree. (To be clear, AlphaGo has everything call an intuition network which is only trained between games, and thus could be implemented in a ‘vertical’ fashion if you have the intuition network as part of the state of each short-lived agent, but you could imagine an improvement on AlphaGo that’s refining its intuition as it considers branches in the game that it’s playing, and that couldn’t be implemented without this horizontal linkage.)
My sense is that the sorts of creative scientific or engineering problems that we’re most interested in are ones where this sort of wide state is relevant and not easily compressible, such that I could easily imagine a world where it takes the scientist a week to digest everything that’s happened so far, and then doesn’t have any time to actually move things forward before vanishing and being replaced by a scientist who spends a week digesting everything, and so on.
As a side note, I claim the ‘recursive function’ interpretation implies that the alignment of the individual agents is irrelevant (so long as they faithfully perform their duties) and the question of whether tree search was the right approach (and whether the leaf evaluation function is good) becomes central to evaluating alignment. This might be something like one of my core complaints, that it seems like we’re just passing the alignment buck to the strategy of how to integrate many small bits of computation into a big bit of computation, and that problem seems just as hard as the regular alignment problem.
- paulfchristiano 27 May 2018 18:03 UTC
  4 points
  Parent
  Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.”
  The efficiency of the hypothetical amplification process doesn’t directly much affect the efficiency of the training process. It affects the number of “rounds” of amplification you need to do, but the rate is probably limited mostly by the ability of the underlying ML to learn new stuff.
  There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space
  You can pick randomly.
  (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
  I agree that this merely reduces the problem of “find a good solution” to “securely evaluate whether a solution is good” (that’s what I was saying in the grandparent).
  or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents
  The idea is to pass around state by distributing it across a large number of agents. Of course it’s an open question whether that works, that’s what we want to figure out.
  (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state)
  Again, the hypothetical amplification process is not intended to be competitive, that’s the whole point of iterated amplification.
  But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree
  Only if we want to be competitive. Otherwise you can just simulate horizontal links by just running the entire other subtree in a subcomputation. In the case of iterated amplification, that couldn’t possibly change the speed of the training process, since only O(1) nodes are actually instantiated at a time anyway and the rest are distilled into the neural network. What would a horizontal link mean?
  the intuition network as part of the state of each short-lived agent
  The intuition network is a distillation of the vertical tree, it’s not part of the amplification process at all.
  and that couldn’t be implemented without this horizontal linkage
  I don’t think that’s right, also I don’t see how a ‘horizontal’ linkage would compare with a normal vertical linkage, just unroll the computation.
  are ones where this sort of wide state is relevant and not easily compressible
  The main thing I’m looking for are examples of particular kinds of state that you think are incompressible. For example, do you think modern science has developed kinds of understanding that couldn’t be distributed across many short-lived individuals (in a way that would let you e.g. use that knowledge to answer questions that a long-lived human could answer using that knowledge)?
  Last time this came up Eliezer used the example of calculus. But I claim that anything you can formalize can’t possibly have this character, since you can distribute those formal representations quite easily, with the role of intuition being to quickly reach conclusions that would take a long time using the formal machinery. That’s exactly the case where amplification works well. (This then lead to the same problem with “if you just manipulate things formally, how can you tell that the hypothesis is just making predictions rather than doing something evil, e.g. can you tell that the theory isn’t itself an optimizer?”, which is what I mentioned in the grandparent.)