Under this view, alignment isn’t a property of reward functions: it’s a property of a reward function in an environment. This problem is much, much harder: we now have the joint task of designing a reward function such that the best way of stringing together favorable observations lines up with what we want. This task requires thinking about how the world is structured, how the agent interacts with us, the agent’s possibilities at the beginning, how the agent’s learning algorithm affects things…
I think there are ways of doing this that don’t involve explicitly working through what observation sequences lead to good outcomes. AFAICT this was originally outlined in Model Based Rewards quite a while ago. Essentially, the idea is to make the reward (or even better, utilty) a function of the agent’s internal model of the world. Then when the agent goes to make a decision, the utility of the worlds where the agent does and does not make take an action are compared. Doing things this way has a couple of nice properties, including eliminating the incentive to wirehead, and making it possible to specify utilities over possible worlds rather than just what the AI sees.
The relevant point however, is that it takes the problem from trying to pin down what chains of events lead to good outcomes, and splits it into a problem of identifying good and bad worldstates in the agents model and building an accurate model of the world. This is because an agent with an accurate model of the world will be able to figure out what sequence of actions and observations lead to any given worldstate.
I feel somewhat pessimistic about doing this robustly enough to scale to AGI. From an earlier comment of mine:
It isn’t obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I’m not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.
I’m personally far more optimistic about ontology identification. Work in representation learning, blog posts such as OpenAI’s sentiment neuron, and style transfer, all indicate that it’s at least possible to point at human level concepts in a subset of world models. Figuring out how to refine these learned representations to further correspond with our intuitions, and figuring out how to rebind those concepts to representations in more advanced ontologies are both areas that are neglected, but they’re both problems that don’t seem fundamentally intractable.
I wasn’t aware of that work, thanks for linking! It’s true that we don’t have to specify the representation; instead, we can learn it. Do you think we could build a diamond maximizer using those ideas, though? The concern here is that the representation has to cleanly demarcate what we think of as diamonds, if we want the optimal policy to entail actually maximizing diamonds in the real world. This problem tastes like it has a bit of that ‘fundamentally intractable’ flavor.
Do you think we could build a diamond maximizer using those ideas, though?
They’re definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.
The concern here is that the representation has to cleanly demarcate what we think of as diamonds.
I think this touches on a related concern, namely goodharting. If we even slightly miss-specify the utility function at the boundary and the AI optimize in an unrestrained fashion, we’ll end up with weird situations that are totally de-correlated with what we we’re initially trying to get the AI to optimize.
If we don’t solve this problem, I agree, the problem is extremely difficult at best and completely intractable at worst. However, If we can reign in goodharting, then I don’t think things are intractable.
To make the point, I think the problem of a AI goodharting a representation is very analogous to the problems being tackled in the field of adversarial perturbations for image classification. In this case, the “representation space” is the image itself. The boundaries are classification boundaries set by the classifying neural network. The optimizing AI that goodharts everyting is usually just some form or gradient decent.
However, the field of adversarial examples seems to indicate that it’s possible to at least partially overcome this form of goodharting and, by anaogy, the goodharting that we would see with a diamond maximiser. IMO, the most promising and general solution seems to be to be more bayesian, and keep track of the uncertainty associated with class label. By keeping track of uncertainty in class labels, it’s possible to avoid class boundaries altogether, and optimize towards regions of the space that are more likely to be part of the desired class label.
I can’t seem to dig it up right now, but I once saw a paper where they developed a robust classifier. When they used SGD to change a picture from being classified as a cat to being classified as a dog, the result was that the underlying image went from looking like a dog to looking like a cat. By analogy, an diamond maximizer with a robust classification of diamonds in it’s representation should actually produce diamonds.
Overall, adversarial examples seem to be a microcosm for evaluating this specific kind of goodharting. My optimism that we can do robust ontology identification is tied to the success of that field, but at the moment the problem doesn’t seem to be intractable.
They’re definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.
Clarification: I meant (but inadequately expressed) “do you think any reasonable extension of these kinds of ideas could get what we want?” Obviously, it would be a quite unfair demand for rigor to demand whether we can do the thing right now.
Thanks for the great reply. I think the remaining disagreement might boil down to the expected difficulty of avoiding Goodhart here. I do agree that using representations is a way around this issue, and it isn’t the representation learning approach’s job to simultaneously deal with Goodharting.
do you think any reasonable extension of these kinds of ideas could get what we want?
Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the “most diamond” might not be the maximum of it’s utility function, but I would expect the maximum of it’s utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it’s utility function of only recognizing something as a diamond if it’s stereotypically “diamond shaped”. This would bar it from just building pure carbon planets to achieve it’s goal.
IMO, you’d need something else outside of the ideas presented to get a “perfect” diamond maximizer.
I think there are ways of doing this that don’t involve explicitly working through what observation sequences lead to good outcomes. AFAICT this was originally outlined in Model Based Rewards quite a while ago. Essentially, the idea is to make the reward (or even better, utilty) a function of the agent’s internal model of the world. Then when the agent goes to make a decision, the utility of the worlds where the agent does and does not make take an action are compared. Doing things this way has a couple of nice properties, including eliminating the incentive to wirehead, and making it possible to specify utilities over possible worlds rather than just what the AI sees.
The relevant point however, is that it takes the problem from trying to pin down what chains of events lead to good outcomes, and splits it into a problem of identifying good and bad worldstates in the agents model and building an accurate model of the world. This is because an agent with an accurate model of the world will be able to figure out what sequence of actions and observations lead to any given worldstate.
I feel somewhat pessimistic about doing this robustly enough to scale to AGI. From an earlier comment of mine:
I’m personally far more optimistic about ontology identification. Work in representation learning, blog posts such as OpenAI’s sentiment neuron, and style transfer, all indicate that it’s at least possible to point at human level concepts in a subset of world models. Figuring out how to refine these learned representations to further correspond with our intuitions, and figuring out how to rebind those concepts to representations in more advanced ontologies are both areas that are neglected, but they’re both problems that don’t seem fundamentally intractable.
I wasn’t aware of that work, thanks for linking! It’s true that we don’t have to specify the representation; instead, we can learn it. Do you think we could build a diamond maximizer using those ideas, though? The concern here is that the representation has to cleanly demarcate what we think of as diamonds, if we want the optimal policy to entail actually maximizing diamonds in the real world. This problem tastes like it has a bit of that ‘fundamentally intractable’ flavor.
They’re definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.
I think this touches on a related concern, namely goodharting. If we even slightly miss-specify the utility function at the boundary and the AI optimize in an unrestrained fashion, we’ll end up with weird situations that are totally de-correlated with what we we’re initially trying to get the AI to optimize.
If we don’t solve this problem, I agree, the problem is extremely difficult at best and completely intractable at worst. However, If we can reign in goodharting, then I don’t think things are intractable.
To make the point, I think the problem of a AI goodharting a representation is very analogous to the problems being tackled in the field of adversarial perturbations for image classification. In this case, the “representation space” is the image itself. The boundaries are classification boundaries set by the classifying neural network. The optimizing AI that goodharts everyting is usually just some form or gradient decent.
The field started when people noticed that even tiny imperceptible perturbations to images in one class would fool a classifier into thinking it was an image from another class. The interesting thing is that when you take this further, you get deep dreaming and inceptionism. The lovecraftian dog-slugs that would arise from the process are are result of the local optimization properties of SGD combined with the flaws of the classifier. Which, I think, is analogous to goodharting in the case of a diamond maximiser with a learnt ontology. The AI will do something weird, it becomes convinced that the world is full of diamonds. Meanwhile, if you ask a human about the world it created, “lovecraftian” will probably precede “diamond” in the description.
However, the field of adversarial examples seems to indicate that it’s possible to at least partially overcome this form of goodharting and, by anaogy, the goodharting that we would see with a diamond maximiser. IMO, the most promising and general solution seems to be to be more bayesian, and keep track of the uncertainty associated with class label. By keeping track of uncertainty in class labels, it’s possible to avoid class boundaries altogether, and optimize towards regions of the space that are more likely to be part of the desired class label.
I can’t seem to dig it up right now, but I once saw a paper where they developed a robust classifier. When they used SGD to change a picture from being classified as a cat to being classified as a dog, the result was that the underlying image went from looking like a dog to looking like a cat. By analogy, an diamond maximizer with a robust classification of diamonds in it’s representation should actually produce diamonds.
Overall, adversarial examples seem to be a microcosm for evaluating this specific kind of goodharting. My optimism that we can do robust ontology identification is tied to the success of that field, but at the moment the problem doesn’t seem to be intractable.
Clarification: I meant (but inadequately expressed) “do you think any reasonable extension of these kinds of ideas could get what we want?” Obviously, it would be a quite unfair demand for rigor to demand whether we can do the thing right now.
Thanks for the great reply. I think the remaining disagreement might boil down to the expected difficulty of avoiding Goodhart here. I do agree that using representations is a way around this issue, and it isn’t the representation learning approach’s job to simultaneously deal with Goodharting.
Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the “most diamond” might not be the maximum of it’s utility function, but I would expect the maximum of it’s utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it’s utility function of only recognizing something as a diamond if it’s stereotypically “diamond shaped”. This would bar it from just building pure carbon planets to achieve it’s goal.
IMO, you’d need something else outside of the ideas presented to get a “perfect” diamond maximizer.