(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I’m continuing that process here.)
I get the impression that you see most of the “builder” moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
But I’m generally not having an easy time understanding why you hold these views. In particular, a central scary case I’m thinking of is something like: “We hit the problem described in the ‘New counterexample: ontology mismatch’ section, and with the unfamiliar ontology, it’s just ‘easier/more natural’ in some basic sense to predict observations like ‘The human says the diamond is still there’ than to find ‘translations’ into a complex, unwieldy human ontology.” In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.
For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
This is because of the remark on ensembling—as long as we aren’t optimizing for scariness (or diversity for diversity’s sake), it seems like it’s way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply.
(Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.)
This is also true for data—I’d be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons).
It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
I believe this in a few cases (especially combining “compress the predictor,” imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better.
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.
I don’t think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I’m more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work.
I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we’d then want to give those as one input into that much messier process in order to think about what’s really going to happen.
In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
I think this is plausible for complexity and to a lesser extent for computation time. I don’t think it’s very plausible for the most exciting regularizers, e.g. a good version of penalizing dependence on upstream nodes or the versions of computation time that scale best (and are really trying to incentivize the model to “reuse” inference that was done in the AI model). I think I do basically believe the arguments given in those cases, e.g. I can’t easily see how translation into the human ontology can be more downstream than “use the stuff to generate observations then parse those observations.”
(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I’m continuing that process here.)
I get the impression that you see most of the “builder” moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
But I’m generally not having an easy time understanding why you hold these views. In particular, a central scary case I’m thinking of is something like: “We hit the problem described in the ‘New counterexample: ontology mismatch’ section, and with the unfamiliar ontology, it’s just ‘easier/more natural’ in some basic sense to predict observations like ‘The human says the diamond is still there’ than to find ‘translations’ into a complex, unwieldy human ontology.” In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.
This is because of the remark on ensembling—as long as we aren’t optimizing for scariness (or diversity for diversity’s sake), it seems like it’s way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply.
(Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.)
This is also true for data—I’d be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons).
I believe this in a few cases (especially combining “compress the predictor,” imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better.
I don’t think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I’m more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work.
I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we’d then want to give those as one input into that much messier process in order to think about what’s really going to happen.
I think this is plausible for complexity and to a lesser extent for computation time. I don’t think it’s very plausible for the most exciting regularizers, e.g. a good version of penalizing dependence on upstream nodes or the versions of computation time that scale best (and are really trying to incentivize the model to “reuse” inference that was done in the AI model). I think I do basically believe the arguments given in those cases, e.g. I can’t easily see how translation into the human ontology can be more downstream than “use the stuff to generate observations then parse those observations.”