How are you getting the connection between the legible property the evaluator is selecting for and actual alignment?
Quoting from another comment (not sure if this is frowned upon):
1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer. 2. Attempt to (outer) align the other subsystem (evaluator) to the human’s true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).
As it stands, this seems like a way to train a capable agent that’s hyperspecialized on some particularly legible goal.
I’m not entirely sure what you mean by legible. Do you mean a determinstic reward model which runs on a computer, even though it might have a gazillion parameters? As in, legible with respect to the human’s objective?
Or to color your thinking a little more, how is the evaluator going to interact with humans, learn about them, and start modeling what they want?
In this scheme, the evaluator is not actively interacting with humans, which indeed appears to be a shortcoming in most ways. The main source of information it gets to use in modeling what humans want is the combination of initial positive examples and ever trickier negative examples posed by the agent. Hm, that gets me thinking about ways of complementing the agent as a source of negative examples with CIRL-style reaching out for positive examples, among others.
“Legible” in the sense of easy to measure. For example, “what makes the human press the Like button” is legible. On the other hand, “what are the human’s preferences” is often illegible.
The AI’s inferred human preferences are typically latent variables within a model of humans. Not just any model will do; we have to some how get the AI to model humans in a way that mostly-satisfies our opinions about what our preferences are and what good reasoning about them is.
Hm, I think I get the issue you’re pointing at. I guess the argument for the evaluator learning accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:
I get the impression that with Oversight Leagues, you don’t necessarily consider the possibility that there might be many different “limits” of the oversight process, that are coherent with the initial examples. And it’s not clear you have an argument that it’s going to pick one that we actually want.
Or in your terms:
Not just any model will do
I’m indeed not sure if the agent’s pressure would force the evaluator all the way to accurate human preferences. The fact that GANs get significantly closer to the illegible distributions they model and away from random noise while following a legible objective feels like evidence for, but the fact that they still have artifacts feels like evidence against. Also, I’m not sure how GANs fare against purely generative models trained on the positive examples alone (e.g. VAEs) as data on whether the adversarial regime helps point at the underlying distribution.
Thanks a lot for the feedback!
Quoting from another comment (not sure if this is frowned upon):
I’m not entirely sure what you mean by legible. Do you mean a determinstic reward model which runs on a computer, even though it might have a gazillion parameters? As in, legible with respect to the human’s objective?
In this scheme, the evaluator is not actively interacting with humans, which indeed appears to be a shortcoming in most ways. The main source of information it gets to use in modeling what humans want is the combination of initial positive examples and ever trickier negative examples posed by the agent. Hm, that gets me thinking about ways of complementing the agent as a source of negative examples with CIRL-style reaching out for positive examples, among others.
“Legible” in the sense of easy to measure. For example, “what makes the human press the Like button” is legible. On the other hand, “what are the human’s preferences” is often illegible.
The AI’s inferred human preferences are typically latent variables within a model of humans. Not just any model will do; we have to some how get the AI to model humans in a way that mostly-satisfies our opinions about what our preferences are and what good reasoning about them is.
Hm, I think I get the issue you’re pointing at. I guess the argument for the evaluator learning accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:
Or in your terms:
I’m indeed not sure if the agent’s pressure would force the evaluator all the way to accurate human preferences. The fact that GANs get significantly closer to the illegible distributions they model and away from random noise while following a legible objective feels like evidence for, but the fact that they still have artifacts feels like evidence against. Also, I’m not sure how GANs fare against purely generative models trained on the positive examples alone (e.g. VAEs) as data on whether the adversarial regime helps point at the underlying distribution.