I realize this is accidentally sounds like it’s saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). I think it’s the agent’s capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don’t think the (basic) dynamics are too difficult.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample’s specific level of capabilities. Note that C doesn’t have to just be an ML model. It could be any process at all, including “ask a human”, “interpret the sample as a computer program trying to solve some problem, run the program, and score the output”, etc.
Having C allows you to sample from a version of M’s output distribution that has been “updated” on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D’, which you can then train M’ on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M’ / M″ / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C’ off of M’, and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you’ll eventually stop benefiting from this process.
This process goes further in domains where it’s easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There’s some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That’s why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C’ that can let you move ‘up the chain’ of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don’t have any reason to presume we have a way to ‘rescue ourselves from’ in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C’ that can let you move ‘up the chain’ of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don’t have any reason to presume we have a way to ‘rescue ourselves from’ in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)