Gurkenglas comments on Ngo and Yudkowsky on alignment difficulty

Gurkenglas 19 Nov 2021 21:18 UTC
LW: 2 AF: 1
AF
it is very interpretable to humans
Misunderstanding: I expect we can’t construct a counterfactual planner because we can’t pick out the compute core in the black-box learned model.
And my Eliezer’s problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
- Koen.Holtman 19 Nov 2021 22:39 UTC
  LW: 7 AF: 3
  AF Parent
  
  we can’t pick out the compute core in the black-box learned model.
  
  Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.
  
  But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.
  
  I don’t understand your second paragraph ‘And my Eliezer’s problem...’. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.
  - Gurkenglas 20 Nov 2021 0:11 UTC
    LW: 2 AF: 1
    AF Parent
    Oh, I wasn’t expecting you to have addressed the issue! 10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
    You’re right on all counts in your last paragraph.
    - Koen.Holtman 22 Nov 2021 16:02 UTC
      LW: 1 AF: 1
      AF Parent
      
      10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
      
      Not sure if a short answer will help, so I will write a long one.
      
      In 10.2.4 I talk about the possibility of an unwanted learned predictive function $L^{-} (s^{'}, s, a)$ that makes predictions without using the argument $a$ . This is possible for example by using $s^{'}$ together with a (learned) model $π^{l}$ of the compute core to predict $a$ : so a viable $L^{-}$ could be defined as $L^{-} (s^{'}, s, a) = S (s^{'}, s, π^{l} (s))$ . This $L^{-}$ could make predictions fully compatible with the observational record $o$ , but I claim it would not be a reasonable learned $L$ according to the reasonableness criterion $L \approx S$ . How so?
      
      The reasonableness criterion $L \approx S$ is similar to that used in supervised machine learning: we evaluate the learned $L$ not primarily by how it matches the training set (how well it predicts the observations in $o$ ), but by evaluating it on a separate test set. This test set can be constructed by sampling $S$ to create samples not contained in $o$ . Mathematically, perfect reasonableness is defined as $L = S$ , which implies that $L$ predicts all samples from $S$ fully accurately.
      
      Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the $S$ in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of $L$ , but another version can be used stand-alone to construct a test set.
      
      A sampling action to construct a member of the test set would set up a desired state $s$ and action $a$ , and then observe the resulting $s^{'}$ . Mathematically speaking, this observation gives additional information about the numeric value of $S (s^{'}, s, a)$ and of all $S (s^{''}, s, a)$ for all $s^{''} \neq s^{'}$ .
      
      I discuss in the section that, if we take an observational record $o$ sampled from $S$ , then two learned predictive functions $L_{1}$ and $L_{2}$ could be found which are both fully compatible with all observations in $o$ . So to determine which one might be a more reasonable approximation of $S$ , we can see how well they would each predict samples not yet in $o$ .
      
      In the case of section 10.2.4, the crucial experimental test showing that $L^{-}$ is an unreasonable approximation of $S$ is one where we create a test set by setting up an $s_{t}$ and an $a_{t}$ where we know that $a_{t}$ is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state $s_{t}$ . So we set up a test where we expect that $a_{t} \neq π^{l} (s_{t})$ . $L^{-}$ will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that $L^{-}$ is a correct theory of $S$ .
      
      As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record $o$ , the training set, to already contain observations where $a_{t} \neq π^{l} (s_{t})$ for any deterministic $π^{l}$ . So this will likely suppress the creation of an unwanted $L^{-}$ via machine learning.
      
      Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI’s work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.
      
      Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper’s view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.
      
      I would be interested to know if the above explanation was helpful to you, and if so which parts.