Jeremy Gillen comments on Thomas Kwa’s Shortform

Jeremy Gillen 12 Jun 2024 22:00 UTC
30 points
8
A dramatic advance in the theory of predicting the regret of RL agents. So given a bunch of assumptions about the properties of an environment, we could upper bound the regret with high probability. Maybe have a way to improve the bound as the agent learns about the environment. The theory would need to be flexible enough that it seems like it should keep giving reasonable bounds if the is agent doing things like building a successor. I think most agent foundations research can be framed as trying to solve a sub-problem of this problem, or a variant of this problem, or understand the various edge cases.
If we can empirically test this theory in lots of different toy environments with current RL agents, and the bounds are usually pretty tight, then that’d be a big update for me. Especially if we can deliberately create edge cases that violate some assumptions and can predict when things will break from which assumptions we violated.
(although this might not bring doom below 25% for me, depends also on race dynamics and the sanity of the various decision-makers).
- Garrett Baker 14 Jun 2024 7:44 UTC
  6 points
  2
  Parent
  Seems you’re left with outer alignment after solving this. What do you imagine doing to solve that?
  - Jeremy Gillen 15 Jun 2024 13:18 UTC
    4 points
    0
    Parent
    We might have developed techniques to specify simple, bounded object-level goals. Goals that can be fully specified using very simple facts about reality, with no indirection or meta level complications. If so, we can probably use inner aligned agents to assist with some relativity well specified engineering or scientific problems. Specification mistakes at that point could easily result in irreversible loss of control, so it’s not the kind of capability I’d want lots of people to have access to.
    
    To move past this point, we would need to make some engineering or scientific advances that would be helpful for solving the problem more permanently. Human intelligence enhancement would be a good thing to try. Maybe some kind of AI defence system to shut down any rogue AI that shows up. Maybe some monitoring tech that helps governments co-ordinate. These are basically the same as the examples given on the pivotal act page.
- Thomas Kwa 13 Jun 2024 21:13 UTC
  6 points
  0
  Parent
  Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent’s actions, as at some point you’re just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?
  - Jeremy Gillen 13 Jun 2024 22:15 UTC
    4 points
    0
    Parent
    Yes, if you have a very high bar for assumptions or the strength of the bound, it is impossible.
    Fortunately, we don’t need a guarantee this strong. One research pathway is to weaken the requirements until they no longer cause a contradiction like this, while maintaining most of the properties that you wanted from the guarantee. For example, one way to weaken the requirements is to require that the agent provably does well relative to what is possible for agents of similar runtime. This still gives us a reasonable guarantee (“it will do as well as it possibly could have done”) without requiring that it solve the halting problem.