I’m generally bad at communicating about this kind of thing, and it seems like a kind of sensitive topic to share half-baked thoughts on. In this AMA all of my thoughts are half-baked, and in some cases here I’m commenting on work that I’m not that familiar with. All that said I’m still going to answer but please read with a grain of salt and don’t take it too seriously.
Vanessa Kosoy’s learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen’s research, e.g. the paper on imitation learning seems to go into a similar direction.
I like working on well-posed problems, and proving theorems about well-posed problems are particularly great.
I don’t currently expect to be able to apply those kinds of algorithms directly to alignment for various reasons (e.g. no source of adequate reward function that doesn’t go through epistemic competitiveness which would also solve other aspects of the problem, not practical to get exact imitation), so I’m mostly optimistic about learning something in the course of solving those problems that turns out to be helpful. I think that’s plausible because these formal problems do engage some of the difficulties from the full problem.
I think that’s probably less good than work that I regard as more directly applicable, though not by a huge factor. I think the major disagreement there is that other folks do regard it as more directly applicable, and I think it’s pretty great to sort through those disagreements.
I think that it’s a good idea to apply this kind of analysis to RL systems we deploy, I think it’s valuable to get RL researchers to think through some of these questions formally, and I think that this kind of work can make it easier for them to do that / more likely for them to lead to correct conclusions.
I have a few concerns about whether this kind of framework can substitute or complement other kinds of safety work:
Even if the framework was exhaustive, it seems like most of the systems we want to deploy will ultimately have problematic control incentives and so we won’t be able to use this as a filter or key component of our analysis strategy.
In light of that it kind of feels like we need to have a more detailed understanding of the particular way in which our agents are responding to those incentives (or at least a much more detailed understanding of exactly what those incentives are, if we are setting aside inner alignment problems), and at that point it’s less clear to me that we would need a more general framework.
I think the existing approach and easy improvements don’t seem like they can capture many important incentives such that you don’t want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A’s predictions about B’s actions—then we want to say that the system has an incentive to manipulate the world but it doesn’t seem like that is easy to incorporate into this kind of formalism).
Overall I’m somewhat more excited about the learning-theoretic agenda (discussed above) than this line of work.
Work on agent foundations, such as on cartesian frames. You have commented on MIRI’s research in the past, but maybe you have an updated view.
I sometimes find this kind of work (especially logical inductors and to a lesser extent cartesian frames) helpful in my own thinking about reasoners, e.g. as a source of examples or to help see a way to get a handle on a messy system. Again, I’d significantly prefer something that seemed to be more directly going for the throat and then doing other stuff when it came up, but I do think it has significant value, probably in between the last two.
Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I’ve paraphrased the three bullet points, and responded in reverse order:
3) Many important incentives are not captured by the approach—e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.
-> Agreed. We’re starting to study “side-effect incentives” (improved name pending), which have this property. We’re still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.
2) sometimes we need more specific quantities, than just D affects A.
-> Agreed. We’ve privately discussed directional quantities like “do(D=d) causes A=a” as being more safety-relevant, and are happy to hear other ideas.
1) eliminating all control-incentives seems unrealistic
-> Strongly agree it’s infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human’s values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.
Overall, I concede that we haven’t engaged much on safety issues in the last year. Partly, it’s that the projects have had to fit within people’s PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.
I think the existing approach and easy improvements don’t seem like they can capture many important incentives such that you don’t want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A’s predictions about B’s actions—then we want to say that the system has an incentive to manipulate the world but it doesn’t seem like that is easy to incorporate into this kind of formalism).
This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs). We’re still working on these as there are a range of subtleties, but I’m pretty confident we’ll have a good account of it.
I’m generally bad at communicating about this kind of thing, and it seems like a kind of sensitive topic to share half-baked thoughts on. In this AMA all of my thoughts are half-baked, and in some cases here I’m commenting on work that I’m not that familiar with. All that said I’m still going to answer but please read with a grain of salt and don’t take it too seriously.
I like working on well-posed problems, and proving theorems about well-posed problems are particularly great.
I don’t currently expect to be able to apply those kinds of algorithms directly to alignment for various reasons (e.g. no source of adequate reward function that doesn’t go through epistemic competitiveness which would also solve other aspects of the problem, not practical to get exact imitation), so I’m mostly optimistic about learning something in the course of solving those problems that turns out to be helpful. I think that’s plausible because these formal problems do engage some of the difficulties from the full problem.
I think that’s probably less good than work that I regard as more directly applicable, though not by a huge factor. I think the major disagreement there is that other folks do regard it as more directly applicable, and I think it’s pretty great to sort through those disagreements.
I think that it’s a good idea to apply this kind of analysis to RL systems we deploy, I think it’s valuable to get RL researchers to think through some of these questions formally, and I think that this kind of work can make it easier for them to do that / more likely for them to lead to correct conclusions.
I have a few concerns about whether this kind of framework can substitute or complement other kinds of safety work:
Even if the framework was exhaustive, it seems like most of the systems we want to deploy will ultimately have problematic control incentives and so we won’t be able to use this as a filter or key component of our analysis strategy.
In light of that it kind of feels like we need to have a more detailed understanding of the particular way in which our agents are responding to those incentives (or at least a much more detailed understanding of exactly what those incentives are, if we are setting aside inner alignment problems), and at that point it’s less clear to me that we would need a more general framework.
I think the existing approach and easy improvements don’t seem like they can capture many important incentives such that you don’t want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A’s predictions about B’s actions—then we want to say that the system has an incentive to manipulate the world but it doesn’t seem like that is easy to incorporate into this kind of formalism).
Overall I’m somewhat more excited about the learning-theoretic agenda (discussed above) than this line of work.
I sometimes find this kind of work (especially logical inductors and to a lesser extent cartesian frames) helpful in my own thinking about reasoners, e.g. as a source of examples or to help see a way to get a handle on a messy system. Again, I’d significantly prefer something that seemed to be more directly going for the throat and then doing other stuff when it came up, but I do think it has significant value, probably in between the last two.
Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I’ve paraphrased the three bullet points, and responded in reverse order:
3) Many important incentives are not captured by the approach—e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.
-> Agreed. We’re starting to study “side-effect incentives” (improved name pending), which have this property. We’re still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.
2) sometimes we need more specific quantities, than just D affects A.
-> Agreed. We’ve privately discussed directional quantities like “do(D=d) causes A=a” as being more safety-relevant, and are happy to hear other ideas.
1) eliminating all control-incentives seems unrealistic
-> Strongly agree it’s infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human’s values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.
Overall, I concede that we haven’t engaged much on safety issues in the last year. Partly, it’s that the projects have had to fit within people’s PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.
This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs). We’re still working on these as there are a range of subtleties, but I’m pretty confident we’ll have a good account of it.