I’m reading along, and I don’t follow the section “Strategy: have AI help humans improve our understanding”. The problem so far is that the AI only need identify bad outcomes that the human labelers can identify, rather than bad outcomes regardless of human-labeler identification.
The solution posed here is to have AIs help the human labeler understand more bad (and good) outcomes, using powerful AI. The section mostly provides justification for making the assumption that we can align these helper AIs (reason: the authors believe there is a counterexample even given this optimistic assumption, so that is where the meat of the discussion currently lies).
But I don’t understand why this helps? I believe the intended outcome is that SmartVault actually tries to figure out whether the diamond has been stolen, rather than whether the human believes the diamond has been stolen. But it still seems to me on the table for it to instead learn to identify whether the human with an AI helper believes the diamond is stolen.
I’m not sure why the addition of an AI is expected to help, other than the obvious manner of expanding the dataset that the labeler can come to strong opinions about. Is that it? Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?
I would appreciate clarification on this point, I suspect I’m missing something simple.
Added: The post distinguishes between a human operator and a human labeler, and I think this is part of what the post believes is the key point of this particular strategy. But I currently don’t see how an AI-assisted human operator, that is able to outwit a human labeler, translates into SmartVault using something other than the Bayes net the human labeler is using.
Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?
Yes, that’s the main way this could work. The question is whether an AI understands things that humans can’t understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly “until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses” (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter’s job much harder and greatly restrict the space of cases where it fails to report tampering.
Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.
This is also legitimately less obvious. An AI can’t simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes “everything the human could understand with the amount of science they have time to do.”
But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is “trying its best” to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.
My guess is that this doesn’t work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.
It will depend on how much much high-quality data you need to train the reporter. Probably it’s a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.
I’m reading along, and I don’t follow the section “Strategy: have AI help humans improve our understanding”. The problem so far is that the AI only need identify bad outcomes that the human labelers can identify, rather than bad outcomes regardless of human-labeler identification.
The solution posed here is to have AIs help the human labeler understand more bad (and good) outcomes, using powerful AI. The section mostly provides justification for making the assumption that we can align these helper AIs (reason: the authors believe there is a counterexample even given this optimistic assumption, so that is where the meat of the discussion currently lies).
But I don’t understand why this helps? I believe the intended outcome is that SmartVault actually tries to figure out whether the diamond has been stolen, rather than whether the human believes the diamond has been stolen. But it still seems to me on the table for it to instead learn to identify whether the human with an AI helper believes the diamond is stolen.
I’m not sure why the addition of an AI is expected to help, other than the obvious manner of expanding the dataset that the labeler can come to strong opinions about. Is that it? Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?
I would appreciate clarification on this point, I suspect I’m missing something simple.
Added: The post distinguishes between a human operator and a human labeler, and I think this is part of what the post believes is the key point of this particular strategy. But I currently don’t see how an AI-assisted human operator, that is able to outwit a human labeler, translates into SmartVault using something other than the Bayes net the human labeler is using.
Yes, that’s the main way this could work. The question is whether an AI understands things that humans can’t understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly “until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses” (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter’s job much harder and greatly restrict the space of cases where it fails to report tampering.
Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.
This is also legitimately less obvious. An AI can’t simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes “everything the human could understand with the amount of science they have time to do.”
But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is “trying its best” to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.
My guess is that this doesn’t work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.
This is an interesting tack, this step and the next (“Strategy: have humans adopt the optimal Bayes net”) feels new to me.
Question: what’s the relative amount of compute you are imagining SmartVault and the helper AI having? Both the same, or one having a lot more?
It will depend on how much much high-quality data you need to train the reporter. Probably it’s a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.