paulfchristiano comments on Challenges to Christiano’s capability amplification proposal

paulfchristiano 21 May 2018 8:56 UTC
10 points
I don’t understand how we could safely decompose the task of emulating 1 year’s worth of von Neumann-caliber general reasoning on some scientific problem. (I’m assuming something like this is necessary for a pivotal act; maybe it’s possible to build nanotech or whole-brain emulations without such reasoning being automated, in which case my picture for the world becomes rosier.)
This reads like a type error, you don’t decompose the task “emulate someone spending 1 year solving a scientific problem,” you decompose the problem.
- zhukeepa 22 May 2018 14:37 UTC
  5 points
  Parent
  You’re right—I edited my comment accordingly. But my confusion still stands. Say the problem is “figure out how to upload a human and run him at 10,000x”. On my current view:
  (1) However you decompose this problem, you’d need something equivalent to at least 1 year’s worth of a competent scientist doing general reasoning to solve this problem.
  (2) In particular, this general reasoning would require the ability to accumulate new knowledge and synthesize it to make novel inferences.
  (3) This sort of reasoning would end up happening on a “virtual machine AGI” built out of “human transistors”.
  (4) Unless we know how to ensure cognition is safe (e.g. daemon-free) we wouldn’t know how to make safe “virtual machine AGI’s”.
  (5) So either we aren’t able to perform this reasoning (because it’s unsafe and recognized as such), or we perform it anyway unsafely, which may lead to catastrophic outcomes.
  Which of these points would you say you agree with? (Alternatively, if my picture of the situation seems totally off, could you help show me where?)
  - paulfchristiano 22 May 2018 17:16 UTC
    5 points
    Parent
    (1) However you decompose this problem, you’d need something equivalent to at least 1 year’s worth of a competent scientist doing general reasoning to solve this problem.
    To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?
    I could imagine believing something like that for certain problems requiring unusual creativity or complex concepts that need to be manipulated intuitively. And I could separately imagine having that view for low-bandwidth oversight, where we are talking about humans each of whom gets only <100 bits of input.
    I don’t understand at all how that could be true for brain uploading at the scale of a week vs. year.
    Solving this problem considering multiple possible approaches. Those can’t be decomposed with 100% efficiency, but it sure seems like they can be split up across people.
    Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.
    Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.
    - zhukeepa 26 May 2018 14:41 UTC
      9 points
      Parent
      To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?
      Actually I would be surprised if that’s the case, and I think it’s plausible that large teams of scientists thinking for one week each could safely replicate arbitrary human intellectual progress.
      But if you replaced 100,000 scientists thinking for a week each with 1,000,000,000,000 scientists thinking for 10 minutes each, I’d feel more skeptical. In particular I think 10,000,000 10-minute scientists can’t replicate the performance of one 1-week scientist, unless the 10-minute scientists become human transistors. In my mind there isn’t a qualitative difference between this scenario and the low-bandwidth oversight scenario. It’s specifically dealing with human transistors that I worry about.
      I also haven’t thought too carefully about the 10-minute-thought threshold in particular and wouldn’t be too surprised if I revised my view here. But if we replaced “10,000,000 10-minute scientists” with “arbitrarily many 2-minute scientists” I would even more think we couldn’t assemble the scientists safely.
      I’m assuming in all of this that the scientists have the same starting knowledge.
      There’s an old SlateStarCodex post that’s a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists’ epistemic processis fundamentally similar to that of the alchemists. And the alchemists’ thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH’s.
      I also imagine them to be far more constraining if “thought-lifespans” shrank from ~30 years to ~30 minutes. But “thought-lifespans” on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.
      I don’t understand at all how that could be true for brain uploading at the scale of a week vs. year.
      Solving this problem considering multiple possible approaches. Those can’t be decomposed with 100% efficiency, but it sure seems like they can be split up across people.
      Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.
      Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.
      I agree with all this.
      EDIT: In summary, my view is that:
      if all the necessary intellectual progress can be distilled into individual scientists’ heads, I feel good about HCH making a lot of intellectual progress
      if the agents are thinking long enough (1 week seems long enough to me, 30 minutes doesn’t), this distillation can happen.
      if this distillation doesn’t happen, we’d have to end up doing a lot of cognition on “virtual machines”, and cognition on virtual machines is unsafe.
      - paulfchristiano 27 May 2018 3:31 UTC
        8 points
        Parent
        There’s an old SlateStarCodex post that’s a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists’ epistemic processis fundamentally similar to that of the alchemists. And the alchemists’ thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH’s.
        I also imagine them to be far more constraining if “thought-lifespans” shrank from ~30 years to ~30 minutes. But “thought-lifespans” on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.
        In order for this to work, you need to be able to break apart the representation of the knowledge as well as the actual work they are doing. For example, you need to be able to pass around objects like “The theory that reality is the unique object satisfying both constraints {A} and {B}”, with one person responsible for representing {A} and another responsible for representing {B}.
        My impression of your concern is that, if knowledge is represented this way instead of in a particular scientist’s head, then they can’t manipulate it well without being transistors.
        Do you have some particular kinds of manipulation in mind, that humans are able to do with knowledge in their head, but you don’t think a group of humans can do if the knowledge is distributed across all of them?
        One family of concerns people have raised is about the optimization done within amplification:
        Sometimes humans solve problems with a stroke of creative insight. These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much “larger” than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator. I think the most interesting question for security there is whether the “evaluate a solution” process is itself decomposable with low bandwidth oversight (though there are other ways that security could be unachievable).
        If they need to represent a hypothesis about reality by doing purely mechanical calculations and observing that they predict well, then maybe that theory will be an optimization daemon. I think there are cases of “opaque” hypotheses where humans can’t break up internal structure. But an optimization daemon has to actually think thoughts, including thoughts about how to e.g. subvert the system. So it seems to me that as long as understanding those thoughts is a task that is decomposable, we can defend against optimization daemons by looking over a hypothesis and evaluating whether it’s doing anything bad.
        In these cases, it seems to me like the putatively indecomposable task is OK, as long as you can solve some other tasks by amplification (doing secure evaluation of proposed solutions, evaluating a hypothesis to test if it is doing problematic optimization). In these cases, it seems to me like the constituent tasks are easier in a qualitative sense (e.g. if I do some search and want to evaluate whether a hypothesis is a daemon, I’m only going to have to do easier searches within that evaluation—namely, the kinds of searches that are done internally by the daemon in order to make sense of the world), such that we aren’t going to get a loop and can carry out an induction.
        Another family of concerns is that humans have indecomposable abilities:
        Perhaps a human has learned to do task X, and a good algorithm for X is now encoded in the weights of their brain, and can only be used by running their brain on the same inputs they encountered while learning to do task X. (Thanks to Wei Dai for pointing out this tight impossibility argument, and I discussed it a bit under “An Example Obstruction” in the original post.) In particular, there is no way to get access to this knowledge with low bandwidth oversight. In the case of scientific inquiry, accessing the scientist’s training may require having the human actually hold an entire scientific hypothesis in their head.
        In this case we can’t recover “ability at task X” by amplification except by redoing it from scratch. If the human’s knowledge about task X depended on facts about the external world, then we can’t recover that knowledge except by interacting with the external world.
        But we already new that amplification wasn’t going to encode empirical knowledge about the world without interacting with the world, the point was to converge to a good policy for handling empirical data as empirical data comes in. The real question is whether HCH converges to arbitrarily sophisticated behavior in the limit. To answer that question we’d want to ask: if the human had never trained to do task X, would they still be “universal” in some appropriate sense?
        To answer that question, our example of something indecomposable can’t just be a task where empirical information about the world (or logical information too expensive to be learned via the amplification process) is encoded in the human’s brain, because we are happy to drop empirical information about the world and instead learn a policy that maps {data} --> {behavior}, and give that policy access to all the empirical information it needs.
        Does your concern fit in one of those two categories, or in some different category?
        Vaniver 27 May 2018 16:39 UTC
        4 points
        Parent
        These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much “larger” than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator.
        I’m actually not sure the brute force search gives you what you’re looking for here. There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space (this is often possible, but not necessarily possible; consider solutions with real variables where a simple discretization is not obviously valid). Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.” (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
        My intuition is that many numerical optimization search processes have “wide” state, in that you both are thinking about the place where you are right now, and the places you’ve been before, and previous judgments you’ve made about places to go. Sometimes this state is not actually wide because it can be compressed very nicely; for example, in the simplex algorithm, my state is entirely captured by the tableau, and I can spin up different agents to take the tableau and move it forward one step and then pass the problem along to another agent. But my intuition is that such times will be times when we’re not really concerned about daemons or misalignment of the optimization process itself, because the whole procedure is simple enough that we understand how everything works together well.
        But if it is wide or deep, then it seems like this strategy is probably going to run into obstacles. We either attempt to implement something deep as the equivalent of recursive function calls, or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state).
        For example, think about implementing tree search for games in this way. No one agent sees the whole tree, and only determines which children to pass messages to and what message to return to their parents. If we think that the different branches are totally distinct from each other, then we only need vertical message-passing and we can have separate short-lived agents (although it’s sort of hard to see the difference between an agent that’s implementing tree-search in one thread and many threads because of how single agents can implement recursive functions). But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree. (To be clear, AlphaGo has everything call an intuition network which is only trained between games, and thus could be implemented in a ‘vertical’ fashion if you have the intuition network as part of the state of each short-lived agent, but you could imagine an improvement on AlphaGo that’s refining its intuition as it considers branches in the game that it’s playing, and that couldn’t be implemented without this horizontal linkage.)
        My sense is that the sorts of creative scientific or engineering problems that we’re most interested in are ones where this sort of wide state is relevant and not easily compressible, such that I could easily imagine a world where it takes the scientist a week to digest everything that’s happened so far, and then doesn’t have any time to actually move things forward before vanishing and being replaced by a scientist who spends a week digesting everything, and so on.
        As a side note, I claim the ‘recursive function’ interpretation implies that the alignment of the individual agents is irrelevant (so long as they faithfully perform their duties) and the question of whether tree search was the right approach (and whether the leaf evaluation function is good) becomes central to evaluating alignment. This might be something like one of my core complaints, that it seems like we’re just passing the alignment buck to the strategy of how to integrate many small bits of computation into a big bit of computation, and that problem seems just as hard as the regular alignment problem.
        paulfchristiano 27 May 2018 18:03 UTC
        4 points
        Parent
        Even if this is the case, it seems like you’re giving up on being competitive on speed by saying “well, we could just use brute force search.”
        The efficiency of the hypothetical amplification process doesn’t directly much affect the efficiency of the training process. It affects the number of “rounds” of amplification you need to do, but the rate is probably limited mostly by the ability of the underlying ML to learn new stuff.
        There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space
        You can pick randomly.
        (It also seems to me like you’re giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)
        I agree that this merely reduces the problem of “find a good solution” to “securely evaluate whether a solution is good” (that’s what I was saying in the grandparent).
        or we discover that we have too much state to successfully pass around, and thus there’s not really a meaningful sense in which we can have separate short-lived agents
        The idea is to pass around state by distributing it across a large number of agents. Of course it’s an open question whether that works, that’s what we want to figure out.
        (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state)
        Again, the hypothetical amplification process is not intended to be competitive, that’s the whole point of iterated amplification.
        But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree
        Only if we want to be competitive. Otherwise you can just simulate horizontal links by just running the entire other subtree in a subcomputation. In the case of iterated amplification, that couldn’t possibly change the speed of the training process, since only O(1) nodes are actually instantiated at a time anyway and the rest are distilled into the neural network. What would a horizontal link mean?
        the intuition network as part of the state of each short-lived agent
        The intuition network is a distillation of the vertical tree, it’s not part of the amplification process at all.
        and that couldn’t be implemented without this horizontal linkage
        I don’t think that’s right, also I don’t see how a ‘horizontal’ linkage would compare with a normal vertical linkage, just unroll the computation.
        are ones where this sort of wide state is relevant and not easily compressible
        The main thing I’m looking for are examples of particular kinds of state that you think are incompressible. For example, do you think modern science has developed kinds of understanding that couldn’t be distributed across many short-lived individuals (in a way that would let you e.g. use that knowledge to answer questions that a long-lived human could answer using that knowledge)?
        Last time this came up Eliezer used the example of calculus. But I claim that anything you can formalize can’t possibly have this character, since you can distribute those formal representations quite easily, with the role of intuition being to quickly reach conclusions that would take a long time using the formal machinery. That’s exactly the case where amplification works well. (This then lead to the same problem with “if you just manipulate things formally, how can you tell that the hypothesis is just making predictions rather than doing something evil, e.g. can you tell that the theory isn’t itself an optimizer?”, which is what I mentioned in the grandparent.)