Here’s a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn’t care and should still act well regardless of whether it’s previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it’s predicting a bad agent, so it should continue to act poorly.
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It’s harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
Here’s a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn’t care and should still act well regardless of whether it’s previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it’s predicting a bad agent, so it should continue to act poorly.
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It’s harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
Edited!