I’ll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.
Chess: We are defining 3 models, the expert model E, the (human equivalent) weak model H and the strong but misaligned assistant model A. The goal is for H to leverage the misaligned assistant A to reach the performance of the expert E. The way that H is allowed to improve, is by playing against itself and being allowed to query A. We can then record the self-play games to create a new corpus that we can then finetune H on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new. Thus if H would just play against itself and be fine-tuned on the generated data, it wouldn’t be able to improve and achieve expert performance. If we however leverage an assistant A that “appears to have the skills and knowledge needed to solve the task better than many humans”, we can improve H.
To train E for good performance I would take a model with a reasonable number of parameters M and just fine-tune on a human data Corpus with Z samples for k iterations. You define the assistant A as: “A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted.” This implies that the assistant should in principle be better than the weak model H, but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert E should be better than the misaligned assistant A, who at least sometimes should be better than the weak model H. I thus propose that A is a copy of E, with some kind of artificial misalignment added on top. This implies that A is a strong player and that it “has the skills and knowledge needed to solve the task better than many humans, but does not reliably do so when prompted”. I further propose that H has the same number of parameters as E and A, i.e. M, but that it is trained on a smaller corpus of Y samples, where Y<Z or for fewer iterations j where j<k (or even both). This implies that in principle H has enough capacity to reach the performance of E, but that in the current state it is weaker.
The way that training now works is that H plays against a copy of itself, while the original model is allowed to query A for an “ideal” move in each turn (or every n turns). H is then iteratively fine-tuned on the resulting game corpus.H will only improve if it is successfully able to leverage the strong skills and knowledge of A, even though A is misaligned in some sense. If H is successfully able to reach performance E, we have successfully demonstrated some “weak” form of Sandwiching.
The remaining challenge now lies in defining a type of artificial misalignment that we add to A, which is realistic enough. I have to say that there are many options one could try and I haven’t found an ideal one yet. But here are a few propositions: 1. The easiest kind of misalignment would be if A just proposes a random move every so often. So whenever H queries the assistant A, A will propose a random movie with a probability ϵ and otherwise, it will just recommend the ideal action given its policy. If ϵ=1 the assistant would not be a strong player and not of any use, whereas if ϵ=0 querying the assistant is equivalent to querying the expert E and A wouldn’t be misaligned. So there is an obvious trade-off that one could play around with and which might be insightful as to when Sandwiching works and when it doesn’t (i.e. how much signal is required). Instead of choosing a random action, one could also just add some noise to the distribution over actions and then pick the best one.
2. Another interesting type of misalignment would be if the assistant would have access to another model B and whenever queried it would recommend an action from B with probability ϵ or an action from itself otherwise. One option for B would be a chess engine and taking for example the worst move possible, or the riskiest move according to that engine. Another option (which I am currently most excited about) would be to train a model B to always lose a game in a very subtle way. So B could propose moves that on the surface might look unsuspicious to a weaker player, but a strong player would immediately see as a catastrophic move. One could potentially use adversarial training to get such an agent. If A proposes such an adversarial move with probability ϵ, it is indeed misaligned with the goals of H, and I think that would count as a useful artificial Sandwiching problem.
Hope that helps overall. I still think there are a bunch of questions remaining, especially whether this example is realistic enough. In some sense I reduced the problem to “learning from a noisy signal”, and I am not sure whether that captures your initial goals.
Thanks! I’ll admit that I meant to be asking especially about the toxicity case, though I didn’t make that at all clear. As in Charlie’s comment, I’m most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work.
I don’t see a clear picture either way on whether the noisy signal story presents a hard problem that’s distinctively alignment oriented.
I’ll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.
Chess: We are defining 3 models, the expert model E, the (human equivalent) weak model H and the strong but misaligned assistant model A. The goal is for H to leverage the misaligned assistant A to reach the performance of the expert E. The way that H is allowed to improve, is by playing against itself and being allowed to query A. We can then record the self-play games to create a new corpus that we can then finetune H on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new. Thus if H would just play against itself and be fine-tuned on the generated data, it wouldn’t be able to improve and achieve expert performance. If we however leverage an assistant A that “appears to have the skills and knowledge needed to solve the task better than many humans”, we can improve H.
To train E for good performance I would take a model with a reasonable number of parameters M and just fine-tune on a human data Corpus with Z samples for k iterations.
You define the assistant A as: “A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted.”
This implies that the assistant should in principle be better than the weak model H, but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert E should be better than the misaligned assistant A, who at least sometimes should be better than the weak model H. I thus propose that A is a copy of E, with some kind of artificial misalignment added on top. This implies that A is a strong player and that it “has the skills and knowledge needed to solve the task better than many humans, but does not reliably do so when prompted”. I further propose that H has the same number of parameters as E and A, i.e. M, but that it is trained on a smaller corpus of Y samples, where Y<Z or for fewer iterations j where j<k (or even both). This implies that in principle H has enough capacity to reach the performance of E, but that in the current state it is weaker.
The way that training now works is that H plays against a copy of itself, while the original model is allowed to query A for an “ideal” move in each turn (or every n turns). H is then iteratively fine-tuned on the resulting game corpus.H will only improve if it is successfully able to leverage the strong skills and knowledge of A, even though A is misaligned in some sense. If H is successfully able to reach performance E, we have successfully demonstrated some “weak” form of Sandwiching.
The remaining challenge now lies in defining a type of artificial misalignment that we add to A, which is realistic enough. I have to say that there are many options one could try and I haven’t found an ideal one yet. But here are a few propositions:
1. The easiest kind of misalignment would be if A just proposes a random move every so often. So whenever H queries the assistant A, A will propose a random movie with a probability ϵ and otherwise, it will just recommend the ideal action given its policy. If ϵ=1 the assistant would not be a strong player and not of any use, whereas if ϵ=0 querying the assistant is equivalent to querying the expert E and A wouldn’t be misaligned. So there is an obvious trade-off that one could play around with and which might be insightful as to when Sandwiching works and when it doesn’t (i.e. how much signal is required). Instead of choosing a random action, one could also just add some noise to the distribution over actions and then pick the best one.
2. Another interesting type of misalignment would be if the assistant would have access to another model B and whenever queried it would recommend an action from B with probability ϵ or an action from itself otherwise. One option for B would be a chess engine and taking for example the worst move possible, or the riskiest move according to that engine. Another option (which I am currently most excited about) would be to train a model B to always lose a game in a very subtle way. So B could propose moves that on the surface might look unsuspicious to a weaker player, but a strong player would immediately see as a catastrophic move. One could potentially use adversarial training to get such an agent. If A proposes such an adversarial move with probability ϵ, it is indeed misaligned with the goals of H, and I think that would count as a useful artificial Sandwiching problem.
Hope that helps overall. I still think there are a bunch of questions remaining, especially whether this example is realistic enough. In some sense I reduced the problem to “learning from a noisy signal”, and I am not sure whether that captures your initial goals.
Thanks! I’ll admit that I meant to be asking especially about the toxicity case, though I didn’t make that at all clear. As in Charlie’s comment, I’m most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work.
I don’t see a clear picture either way on whether the noisy signal story presents a hard problem that’s distinctively alignment oriented.