paulfchristiano comments on Challenges to Christiano’s capability amplification proposal

paulfchristiano 27 May 2018 3:31 UTC
8 points
There’s an old SlateStarCodex post that’s a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists’ epistemic processis fundamentally similar to that of the alchemists. And the alchemists’ thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH’s.
I also imagine them to be far more constraining if “thought-lifespans” shrank from ~30 years to ~30 minutes. But “thought-lifespans” on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.
In order for this to work, you need to be able to break apart the representation of the knowledge as well as the actual work they are doing. For example, you need to be able to pass around objects like “The theory that reality is the unique object satisfying both constraints {A} and {B}”, with one person responsible for representing {A} and another responsible for representing {B}.
My impression of your concern is that, if knowledge is represented this way instead of in a particular scientist’s head, then they can’t manipulate it well without being transistors.
Do you have some particular kinds of manipulation in mind, that humans are able to do with knowledge in their head, but you don’t think a group of humans can do if the knowledge is distributed across all of them?
One family of concerns people have raised is about the optimization done within amplification:
- Sometimes humans solve problems with a stroke of creative insight. These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much “larger” than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator. I think the most interesting question for security there is whether the “evaluate a solution” process is itself decomposable with low bandwidth oversight (though there are other ways that security could be unachievable).
- If they need to represent a hypothesis about reality by doing purely mechanical calculations and observing that they predict well, then maybe that theory will be an optimization daemon. I think there are cases of “opaque” hypotheses where humans can’t break up internal structure. But an optimization daemon has to actually think thoughts, including thoughts about how to e.g. subvert the system. So it seems to me that as long as understanding those thoughts is a task that is decomposable, we can defend against optimization daemons by looking over a hypothesis and evaluating whether it’s doing anything bad.
In these cases, it seems to me like the putatively indecomposable task is OK, as long as you can solve some other tasks by amplification (doing secure evaluation of proposed solutions, evaluating a hypothesis to test if it is doing problematic optimization). In these cases, it seems to me like the constituent tasks are easier in a qualitative sense (e.g. if I do some search and want to evaluate whether a hypothesis is a daemon, I’m only going to have to do easier searches within that evaluation—namely, the kinds of searches that are done internally by the daemon in order to make sense of the world), such that we aren’t going to get a loop and can carry out an induction.
Another family of concerns is that humans have indecomposable abilities:
- Perhaps a human has learned to do task X, and a good algorithm for X is now encoded in the weights of their brain, and can only be used by running their brain on the same inputs they encountered while learning to do task X. (Thanks to Wei Dai for pointing out this tight impossibility argument, and I discussed it a bit under “An Example Obstruction” in the original post.) In particular, there is no way to get access to this knowledge with low bandwidth oversight. In the case of scientific inquiry, accessing the scientist’s training may require having the human actually hold an entire scientific hypothesis in their head.
In this case we can’t recover “ability at task X” by amplification except by redoing it from scratch. If the human’s knowledge about task X depended on facts about the external world, then we can’t recover that knowledge except by interacting with the external world.
But we already new that amplification wasn’t going to encode empirical knowledge about the world without interacting with the world, the point was to converge to a good policy for handling empirical data as empirical data comes in. The real question is whether HCH converges to arbitrarily sophisticated behavior in the limit. To answer that question we’d want to ask: if the human had never trained to do task X, would they still be “universal” in some appropriate sense?
To answer that question, our example of something indecomposable can’t just be a task where empirical information about the world (or logical information too expensive to be learned via the amplification process) is encoded in the human’s brain, because we are happy to drop empirical information about the world and instead learn a policy that maps {data} --> {behavior}, and give that policy access to all the empirical information it needs.
Does your concern fit in one of those two categories, or in some different category?
- Vaniver 27 May 2018 16:39 UTC
  4 points
  Parent
  These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much “larger” than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator.
  I’m actually not sure the brute force search gives you what you’re looking for here. There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space (this is often possible, but not necessarily possible; consider solutions with real variables where a simple discretization is not obviously valid). Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.” (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
  My intuition is that many numerical optimization search processes have “wide” state, in that you both are thinking about the place where you are right now, and the places you’ve been before, and previous judgments you’ve made about places to go. Sometimes this state is not actually wide because it can be compressed very nicely; for example, in the simplex algorithm, my state is entirely captured by the tableau, and I can spin up different agents to take the tableau and move it forward one step and then pass the problem along to another agent. But my intuition is that such times will be times when we’re not really concerned about daemons or misalignment of the optimization process itself, because the whole procedure is simple enough that we understand how everything works together well.
  But if it is wide or deep, then it seems like this strategy is probably going to run into obstacles. We either attempt to implement something deep as the equivalent of recursive function calls, or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state).
  For example, think about implementing tree search for games in this way. No one agent sees the whole tree, and only determines which children to pass messages to and what message to return to their parents. If we think that the different branches are totally distinct from each other, then we only need vertical message-passing and we can have separate short-lived agents (although it’s sort of hard to see the difference between an agent that’s implementing tree-search in one thread and many threads because of how single agents can implement recursive functions). But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree. (To be clear, AlphaGo has everything call an intuition network which is only trained between games, and thus could be implemented in a ‘vertical’ fashion if you have the intuition network as part of the state of each short-lived agent, but you could imagine an improvement on AlphaGo that’s refining its intuition as it considers branches in the game that it’s playing, and that couldn’t be implemented without this horizontal linkage.)
  My sense is that the sorts of creative scientific or engineering problems that we’re most interested in are ones where this sort of wide state is relevant and not easily compressible, such that I could easily imagine a world where it takes the scientist a week to digest everything that’s happened so far, and then doesn’t have any time to actually move things forward before vanishing and being replaced by a scientist who spends a week digesting everything, and so on.
  As a side note, I claim the ‘recursive function’ interpretation implies that the alignment of the individual agents is irrelevant (so long as they faithfully perform their duties) and the question of whether tree search was the right approach (and whether the leaf evaluation function is good) becomes central to evaluating alignment. This might be something like one of my core complaints, that it seems like we’re just passing the alignment buck to the strategy of how to integrate many small bits of computation into a big bit of computation, and that problem seems just as hard as the regular alignment problem.
  - paulfchristiano 27 May 2018 18:03 UTC
    4 points
    Parent
    Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.”
    The efficiency of the hypothetical amplification process doesn’t directly much affect the efficiency of the training process. It affects the number of “rounds” of amplification you need to do, but the rate is probably limited mostly by the ability of the underlying ML to learn new stuff.
    There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space
    You can pick randomly.
    (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
    I agree that this merely reduces the problem of “find a good solution” to “securely evaluate whether a solution is good” (that’s what I was saying in the grandparent).
    or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents
    The idea is to pass around state by distributing it across a large number of agents. Of course it’s an open question whether that works, that’s what we want to figure out.
    (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state)
    Again, the hypothetical amplification process is not intended to be competitive, that’s the whole point of iterated amplification.
    But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree
    Only if we want to be competitive. Otherwise you can just simulate horizontal links by just running the entire other subtree in a subcomputation. In the case of iterated amplification, that couldn’t possibly change the speed of the training process, since only O(1) nodes are actually instantiated at a time anyway and the rest are distilled into the neural network. What would a horizontal link mean?
    the intuition network as part of the state of each short-lived agent
    The intuition network is a distillation of the vertical tree, it’s not part of the amplification process at all.
    and that couldn’t be implemented without this horizontal linkage
    I don’t think that’s right, also I don’t see how a ‘horizontal’ linkage would compare with a normal vertical linkage, just unroll the computation.
    are ones where this sort of wide state is relevant and not easily compressible
    The main thing I’m looking for are examples of particular kinds of state that you think are incompressible. For example, do you think modern science has developed kinds of understanding that couldn’t be distributed across many short-lived individuals (in a way that would let you e.g. use that knowledge to answer questions that a long-lived human could answer using that knowledge)?
    Last time this came up Eliezer used the example of calculus. But I claim that anything you can formalize can’t possibly have this character, since you can distribute those formal representations quite easily, with the role of intuition being to quickly reach conclusions that would take a long time using the formal machinery. That’s exactly the case where amplification works well. (This then lead to the same problem with “if you just manipulate things formally, how can you tell that the hypothesis is just making predictions rather than doing something evil, e.g. can you tell that the theory isn’t itself an optimizer?”, which is what I mentioned in the grandparent.)