Here are a couple of hand-wavy “stub” proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I’m posting them here so they can respond and clarify why these don’t qualify.
*Proposal 1: force ontological compatibility*
On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology—that is, it has a distinct “vocabulary of basic concepts” (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can’t help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with—if not for the mismatch, “AIs explaining things to humans” could ensure that the trickery we’re worried about doesn’t happen.
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.
With this term included, you can “train the AI hard” without having it diverge from human-friendly ontology.
*Proposal 2: “mini-model”*
In this proposal, we construct a head for the AI model whose role is to construct a small AI model that performs as well as possible on the same (predictor, reporter) tasks. In each round of training, the master AI model can modify the mini-model; it also receives a loss based on how well the *previous round’s* mini-model performs on this set of examples. Crucially, the mini-model is sufficiently small that we’re confident it is “significantly less intelligent” than the (perhaps AI-assisted) human supervisor.
The basic hope here is that a “human imitator” model would construct a small model that also tries to imitate the human supervisor—but that this model would perform very badly, because trying to imitate the _reasoning process of a larger model_ is a lot more difficult than simply trying to reason about the world and translate concepts. Or, the “human imitator” model could build a “mini-model” based on entirely different principles, but the hope is that this makes things harder for it compared to the direct translator, which is just compressing what it has already built.
There are lots of potential failure modes here, e.g. maybe it’s just not that hard to have a mini-model that successfully imitates the human. I didn’t get that far with this one, but it was apparently enough for ARC to think it’s already counterexampled by existing counterexamples :)
Again trying to answer this one despite not feeling fully solid. I’m not sure about the second proposal and might come back to it, but here’s my response to the first proposal (force ontological compatibility):
The counterexample “Gradient descent is more efficient than science” should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn’t do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI’s visible thoughts project is going for something like this), but ARC isn’t looking for a solution of that form.
I’m not sure why this isn’t a very general counterexample. Once we’ve decided that the human imitator is simpler and faster to compute, don’t all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren’t they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?
There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the “unaligned benchmark” we’re trying to compare to is trained, and the reporter is the thing that we add onto that to “align” it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)
In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train).
In summary, competitiveness for ELK proposals primarily means that you can’t change the way the predictor was trained. We are already assuming/hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.
I think that a lot depends on what kind of term you include.
If you just say “find more interesting things” then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don’t really have any candidates for defining that in a way that does what you want.
In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like “they don’t think of anything clever.” In general I’d be happy to talk about concrete proposals along these lines.
(I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an “interpreted” version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you’ll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)
Here are a couple of hand-wavy “stub” proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I’m posting them here so they can respond and clarify why these don’t qualify.
*Proposal 1: force ontological compatibility*
On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology—that is, it has a distinct “vocabulary of basic concepts” (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can’t help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with—if not for the mismatch, “AIs explaining things to humans” could ensure that the trickery we’re worried about doesn’t happen.
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.
With this term included, you can “train the AI hard” without having it diverge from human-friendly ontology.
*Proposal 2: “mini-model”*
In this proposal, we construct a head for the AI model whose role is to construct a small AI model that performs as well as possible on the same (predictor, reporter) tasks. In each round of training, the master AI model can modify the mini-model; it also receives a loss based on how well the *previous round’s* mini-model performs on this set of examples. Crucially, the mini-model is sufficiently small that we’re confident it is “significantly less intelligent” than the (perhaps AI-assisted) human supervisor.
The basic hope here is that a “human imitator” model would construct a small model that also tries to imitate the human supervisor—but that this model would perform very badly, because trying to imitate the _reasoning process of a larger model_ is a lot more difficult than simply trying to reason about the world and translate concepts. Or, the “human imitator” model could build a “mini-model” based on entirely different principles, but the hope is that this makes things harder for it compared to the direct translator, which is just compressing what it has already built.
There are lots of potential failure modes here, e.g. maybe it’s just not that hard to have a mini-model that successfully imitates the human. I didn’t get that far with this one, but it was apparently enough for ARC to think it’s already counterexampled by existing counterexamples :)
Again trying to answer this one despite not feeling fully solid. I’m not sure about the second proposal and might come back to it, but here’s my response to the first proposal (force ontological compatibility):
The counterexample “Gradient descent is more efficient than science” should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn’t do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI’s visible thoughts project is going for something like this), but ARC isn’t looking for a solution of that form.
I’m not sure why this isn’t a very general counterexample. Once we’ve decided that the human imitator is simpler and faster to compute, don’t all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren’t they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?
There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the “unaligned benchmark” we’re trying to compare to is trained, and the reporter is the thing that we add onto that to “align” it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)
In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train).
In summary, competitiveness for ELK proposals primarily means that you can’t change the way the predictor was trained. We are already assuming/hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.
I think that a lot depends on what kind of term you include.
If you just say “find more interesting things” then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don’t really have any candidates for defining that in a way that does what you want.
In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like “they don’t think of anything clever.” In general I’d be happy to talk about concrete proposals along these lines.
(I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an “interpreted” version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you’ll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)