I think the main problem with competitiveness is that you are just getting “answers that look good to a human” rather than “actually good answers.”
The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.
My answers to Wei were two-fold: one is that if benignity is established, it’s possible to safely tinker with the setup until hopefully “answers that look good to a human” resembles good answers (we never quite reached an agreement about this). The second was an example of an extended setup (one has to read the parent comments to understand it) which would potentially be much more likely to yield actually good answers; I think we agree about this approach.
My original idea when I started working on this, actually, is also an answer to this concern. The reason it’s not in the paper is because I pared it down to a minimum viable product.
Construct an “oracle” by defining “true answers” as follows: answers which help a human do accurate prediction on a randomly sampled prediction task.*
I figured out that I needed a box, and everything else in this setup, and I realized that the setup could be applied to a normal reinforcement learner just as easily as for this oracle, so I simplified the approach.
I honestly need to dig through notes from last year, but my recollection is this: the operator receives an answer to a query, and then gets a random prediction task, which he has to make a prediction about before leaving the box. Later, the prediction is scored, and this is converted into a reward for BoMAI. BoMAI has a model class for how the prediction is scored; the output of these models is an answer for what the ground truth is. In all of these models, the ground truth doesn’t depend on BoMAI’s answer (that is, the model isn’t given read access to BoMAI’s answer). So the prediction task can involve the prediction of outside world events, and the ground truth can be logged from the outside world, because BoMAI doesn’t conceive of its answer having a causal impact on the copy of the world which provides the ground truth for the prediction tasks. For example, the prediction task might sampled from {“True or false: hexalated kenotones will suppress activity of BGQ-1”, “True or false: fluorinating random lysines in hemoglobin will suppress activity of BGQ-1”, etc.} (half of those terms are made up). After this episode, the prediction can be graded in the outside world. With the obvious scoring rule, the oracle would just say “I don’t care plausible it sounds, whatever they ask you, just say it’s not going to work. Most things don’t.” With a better scoring rule, I would expect it to give accurate information in a human-understandable format.
I haven’t thought about this in a while, and I was honestly worse at thinking about alignment at that point in time, so I don’t mean to convey much confidence that this approach works out. What I do think it shows, alongside the idea I came up with in the conversation with Wei, linked above, is that this general approach is powerful and amenable to improvement in ways that render it even more useful.
* A more recent thought: as described, “oracle” is not the right word for this setup. It would respond to “What approaches might work for curing cancer?” with “Doesn’t matter. There are more gaps in your knowledge regarding economics. A few principles to keep in mind…” However, if the prediction task distribution were conditioned in some way on the question asked, one might be able to make it more likely that the “oracle” answers the question, rather than just spewing unrelated insight.
From Paul:
The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.
My answers to Wei were two-fold: one is that if benignity is established, it’s possible to safely tinker with the setup until hopefully “answers that look good to a human” resembles good answers (we never quite reached an agreement about this). The second was an example of an extended setup (one has to read the parent comments to understand it) which would potentially be much more likely to yield actually good answers; I think we agree about this approach.
My original idea when I started working on this, actually, is also an answer to this concern. The reason it’s not in the paper is because I pared it down to a minimum viable product.
Construct an “oracle” by defining “true answers” as follows: answers which help a human do accurate prediction on a randomly sampled prediction task.*
I figured out that I needed a box, and everything else in this setup, and I realized that the setup could be applied to a normal reinforcement learner just as easily as for this oracle, so I simplified the approach.
I honestly need to dig through notes from last year, but my recollection is this: the operator receives an answer to a query, and then gets a random prediction task, which he has to make a prediction about before leaving the box. Later, the prediction is scored, and this is converted into a reward for BoMAI. BoMAI has a model class for how the prediction is scored; the output of these models is an answer for what the ground truth is. In all of these models, the ground truth doesn’t depend on BoMAI’s answer (that is, the model isn’t given read access to BoMAI’s answer). So the prediction task can involve the prediction of outside world events, and the ground truth can be logged from the outside world, because BoMAI doesn’t conceive of its answer having a causal impact on the copy of the world which provides the ground truth for the prediction tasks. For example, the prediction task might sampled from {“True or false: hexalated kenotones will suppress activity of BGQ-1”, “True or false: fluorinating random lysines in hemoglobin will suppress activity of BGQ-1”, etc.} (half of those terms are made up). After this episode, the prediction can be graded in the outside world. With the obvious scoring rule, the oracle would just say “I don’t care plausible it sounds, whatever they ask you, just say it’s not going to work. Most things don’t.” With a better scoring rule, I would expect it to give accurate information in a human-understandable format.
I haven’t thought about this in a while, and I was honestly worse at thinking about alignment at that point in time, so I don’t mean to convey much confidence that this approach works out. What I do think it shows, alongside the idea I came up with in the conversation with Wei, linked above, is that this general approach is powerful and amenable to improvement in ways that render it even more useful.
* A more recent thought: as described, “oracle” is not the right word for this setup. It would respond to “What approaches might work for curing cancer?” with “Doesn’t matter. There are more gaps in your knowledge regarding economics. A few principles to keep in mind…” However, if the prediction task distribution were conditioned in some way on the question asked, one might be able to make it more likely that the “oracle” answers the question, rather than just spewing unrelated insight.