Steven Byrnes comments on My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes 26 Mar 2021 21:10 UTC
2 points
If you have a really good corpus of AI safety math, I expect it easy to map our problems onto it.
Can you give an example of what you have in mind? Start with “I want my AGI, after ML training, to be trying to help me”. Then we break this grand challenge down and end up with math theorems … how? What do the theorems look like? What do they say? How do they relate back to the fuzzy concept of “trying to help”?
Pattern-matching previously seen math onto the current problem is enough to automate human math research—it’s how humans do it.
FWIW I disagree with this. Humans started knowing no math whatsoever. They searched for patterns, found them, and solidified those patterns into structured models with objects and relations and ways to manipulate these things and so on. Repeat, repeat. It’s not straightforward because there are a combinatorial explosion of different possible patterns that might match, different ways to structure the knowledge, different analogies to apply. Just like there is a combinatorial explosion of possible chess strategies. You need to search through the space, using sophisticated RL that hunts for hints of success and learns the kinds of patterns and strategies that work based on an endless array of subtle contextual clues. Like, ask a mathematician if the very first way they thought about some concept, the first time they came across it, was the right way. Ask a mathematician if they learned a profoundly new idea in the past week. I don’t think it’s just applying the same patterns over and over...
I suspect supervised learners would play chess much worse than humans, because humans can get into a new configuration and figure out what to do by mentally playing through different possibilities, whereas supervised learners are just trying to apply the things that humans already know, unable to figure out anything new. (This is an empirical question, I didn’t look it up, I’m just guessing.)
- Gurkenglas 27 Mar 2021 11:48 UTC
  2 points
  Parent
  example
  You point your math research generator at AI safety. It starts analyzing the graph of what programs will self-modify into what programs; which subgraphs are closed under successors; how you might define structure-preserving morphisms on graphs like this one; what existing category theory applies to the resulting kind of morphism. It finds the more general question of which agents achieve their goals in an environment containing what programs. (Self-modification is equivalent to instantiating such a program.) It finds a bunch of properties that such agents and programs can have—for example, tool AIs that help large classes of agents so long as they don’t ask questions that turn the tool AIs into agent AIs. And then you find that a theorem specializes to “You win if you run a program from this set.” and do so.
  ask a mathematician
  Sure, I don’t see the solution to every question immediately; I have to turn the question over in my head, decompose it, find other questions that are the same until I can solve one of them immediately. And the described generator could do the same, because it would generate the decompositions and rephrasings as lemmata, or as extra conjectures. We would of course need to keep relevant theorems it has already proved in scope so it can apply them, by (very cheap) fine-tuning or tetris-like context window packing. And yes, this gives whatever mesa-optimizer an opportunity to entrench itself in the model.