paulfchristiano comments on Preface to the sequence on iterated amplification

paulfchristiano 2 Nov 2021 18:14 UTC
LW: 17 AF: 9
AF
I think iterated amplification (IDA) is a plausible algorithm to use for training superhuman ML systems. This algorithm is still not really fleshed out, there are various instantiations that are unsatisfactory in one way or another, which is why this post describes it as a research direction rather than an algorithm.
I think there are capability limits on models trained with IDA, which I tried to describe in more detail in the post Inaccessible Information. There are also limits to the size of implicit tree that you can really use, basically mirroring the limits on implicit debate trees explored in Beth’s post on Obfuscated Arguments (roughly speaking, we still think such trees can be arbitrarily big relative to the overseer, but it now seems like their size is bounded by the capability of your models). I have discussed some of these issues in pre-2018 writing, but it was not clear how much they’d force the algorithm to change fundamentally vs get tweaked around the edges.
These issues motivated Imitative Generalization, which is another algorithm for training superhuman ML systems. I see this as pretty contiguous with IDA, and it rests on very similar assumptions. Its capabilities are also bounded by HCH in basically the same way.
That said, it’s also pretty clear that imitative generalization doesn’t handle every possible case (at least not without doing a lot of additional challenging work), and we’re now trying to zoom in on the hardest cases for methods like imitative generalization. This is something we’ll be writing about soon.
I don’t think I would call any of these things “paradigms.” They seem more like “training strategies,” each designed to align AI systems that we previously didn’t know how to align. The overall paradigm is basically what’s described in my methodology post:
- Propose a training strategy that looks like it could avert catastrophic misalignment in the cases identified so far.
- Identify a new “case” in which that training strategy fails—i.e. a combination of facts about the empirical world, about what kind of thing SGD learns, etc. for which that training strategy would lead to catastrophic misalignment.
A different approach for avoiding the limits of IDA is recursive reward modeling (RRM) which uses evaluations-in-hindsight, so that the learned policy is free to leverage intuitions or capabilities that humans couldn’t understand in order to take actions that the overseer couldn’t have recognized as good with foresight but which have good-looking consequences. This lets the ML be smarter but introduces additional safety concerns, since now you need to ensure that a collection of weaker agents can keep a stronger agent in check (and if you fail then you face catastrophic risk). In practice you’d probably combine this with evaluations-in-advance in order to identify any predictably dangerous activities, and so you only really have trouble if a strong agent can overtake slightly weak agents using a plan that doesn’t even look dangerous in advance.
I’d say that RRM is using a different research paradigm: it’s fairly clear that there are possible situations where RRM breaks down, but it seems quite plausible that those will only occur long after AI has fundamentally changed the game. In my own research I’m not comfortable leaning on that kind of empirical contingency, but that’s just a methodological choice by me and most people care more about empirically investigating whether their algorithm actually works in the real world (rather than understanding whether there is any case in which it goes badly).