Gurkenglas comments on My AGI Threat Model: Misaligned Model-Based RL Agent

Gurkenglas 25 Mar 2021 22:46 UTC
5 points
I don’t expect your AGI development model to be the first to market: The world-model alone could be used to write a pivotal AGI. An idealized interactive theorem prover consists of a brute-force proof searcher and a human intuiter of what lemmata would be useful as stepping stones. Use the world-model to predict the human. Use the same mechanism without a theorem given to produce conjectures to prove next. Run it in the general direction of AI safety research until you can build a proper FAI. Only the inner alignment problem remains: The intuiter might, instead of the most mathematically useful conjectures, produce ones that will make the resulting AGI decision theory try to reward those who brought it about, such as whatever mesa-optimizer nudged the intuiter. Therefore I posit that interpretability, rather than the likes of embedded agency or value learning, should be the focus of our research.
- Steven Byrnes 26 Mar 2021 20:52 UTC
  3 points
  Parent
  Therefore I posit that interpretability, rather than the likes of embedded agency or value learning, should be the focus of our research.
  FWIW I’m very strongly in favor of interpretability research.
  I’m slightly pessimistic that insight into embedded agency will be helpful for solving the alignment problem—at least based on my personal experience of (A) thinking about how human brains reason about themselves, (B) thinking about the alignment problem, (C) noting that the former activity has not been helping me with the latter activity. (Maybe there are other reasons to work on embedded agency, I dunno.)
- Steven Byrnes 26 Mar 2021 1:24 UTC
  2 points
  Parent
  (Sorry in advance if I’m misunderstanding.)
  Consider how a human might look at a differential equation and say to themselves, “maybe I can differentiate both sides”, or “maybe this is a linear algebra problem in disguise”. You need all three components for that, not just the world-model. “Deciding to think a certain thought” is the same as “deciding to take a certain action”, in that you need a planner / actor that can generate any of a large set of possible thoughts (or thought sequences) and you need a value function that learns which meta-cognitive strategies have been successful in the past (e.g. “in this type of situation, it’s helpful to think that type of thought”).
  When we want a system to come up with new ideas—beyond the ideas in the training data—we face the problem that there’s a combinatorial explosion of things to try. I’m pessimistic about making great headway in this problem without using RL to learn meta-cognitive strategies. (Just like how supervised learning on human chess moves can play chess about as well as humans, but not much better … but RL can do much better.)
  If I’m wrong, and RL isn’t necessary after all, why not use GPT-3? You can already try the experiment of having GPT-3 throw out theorem / lemma ideas and having an automated theorem prover try to prove them. Sounds fun. I mean, I don’t expect it to work, except maybe on toy examples, but who knows. Even if it did work, I don’t see how it would help with AI safety research. I don’t see the connection between automatically proving a bunch of mathematical theorems and making progress on AI safety research. I suspect that the hard part is reducing the AI safety problem to solvable math problems, not actually solving those math problems. I’m curious what you have in mind there.
  - Ben Pace 26 Mar 2021 1:26 UTC
    4 points
    Parent
    Wow, I’m gonna have to get used to ‘Steven’.
  - Gurkenglas 26 Mar 2021 13:41 UTC
    3 points
    Parent
    I agree that my theory predicts that GPT can be used for this. You don’t need an extra value function if the world you’re modelling already contains optimizers. You will be better at predicting what a good idea guy will say next if you can tell good ideas from bad. That GPT-3 hasn’t proved pivotal is evidence against, but if they keep scaling up GPT, my timelines are very short.
    Supervised learners playing chess as well as humans means that they haven’t memorized human games, but instead human chess-playing patterns to apply. Pattern-matching previously seen math onto the current problem is enough to automate human math research—it’s how humans do it. If you have a really good corpus of AI safety math, I expect it easy to map our problems onto it.
    Compare to category theory, a mathematical language in which most short sentences mean something. You can make progress on many problems merely by translating them into a short sentence in the language of category theory.
    If your model can predict which lemmata a human intuiter would suggest, your model can predict what a human mathematician would think to himself, and what he would decide to write into math chatrooms, blog posts and/or papers. (I wouldn’t have said that before seeing GPT-2.) Putting it in terms of an intuiter merely points out that the deep, difficult work that distinguishes a good mathematician from a mediocre one is exactly the sort of pattern-matching that neural networks are good at.
    - Steven Byrnes 26 Mar 2021 21:10 UTC
      2 points
      Parent
      If you have a really good corpus of AI safety math, I expect it easy to map our problems onto it.
      Can you give an example of what you have in mind? Start with “I want my AGI, after ML training, to be trying to help me”. Then we break this grand challenge down and end up with math theorems … how? What do the theorems look like? What do they say? How do they relate back to the fuzzy concept of “trying to help”?
      Pattern-matching previously seen math onto the current problem is enough to automate human math research—it’s how humans do it.
      FWIW I disagree with this. Humans started knowing no math whatsoever. They searched for patterns, found them, and solidified those patterns into structured models with objects and relations and ways to manipulate these things and so on. Repeat, repeat. It’s not straightforward because there are a combinatorial explosion of different possible patterns that might match, different ways to structure the knowledge, different analogies to apply. Just like there is a combinatorial explosion of possible chess strategies. You need to search through the space, using sophisticated RL that hunts for hints of success and learns the kinds of patterns and strategies that work based on an endless array of subtle contextual clues. Like, ask a mathematician if the very first way they thought about some concept, the first time they came across it, was the right way. Ask a mathematician if they learned a profoundly new idea in the past week. I don’t think it’s just applying the same patterns over and over...
      I suspect supervised learners would play chess much worse than humans, because humans can get into a new configuration and figure out what to do by mentally playing through different possibilities, whereas supervised learners are just trying to apply the things that humans already know, unable to figure out anything new. (This is an empirical question, I didn’t look it up, I’m just guessing.)
      - Gurkenglas 27 Mar 2021 11:48 UTC
        2 points
        Parent
        example
        You point your math research generator at AI safety. It starts analyzing the graph of what programs will self-modify into what programs; which subgraphs are closed under successors; how you might define structure-preserving morphisms on graphs like this one; what existing category theory applies to the resulting kind of morphism. It finds the more general question of which agents achieve their goals in an environment containing what programs. (Self-modification is equivalent to instantiating such a program.) It finds a bunch of properties that such agents and programs can have—for example, tool AIs that help large classes of agents so long as they don’t ask questions that turn the tool AIs into agent AIs. And then you find that a theorem specializes to “You win if you run a program from this set.” and do so.
        ask a mathematician
        Sure, I don’t see the solution to every question immediately; I have to turn the question over in my head, decompose it, find other questions that are the same until I can solve one of them immediately. And the described generator could do the same, because it would generate the decompositions and rephrasings as lemmata, or as extra conjectures. We would of course need to keep relevant theorems it has already proved in scope so it can apply them, by (very cheap) fine-tuning or tetris-like context window packing. And yes, this gives whatever mesa-optimizer an opportunity to entrench itself in the model.