I believe that it is very sensible to bring this sort of structure into our approach to AGI safety research, but at the same time it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
It is a feedback loop where we make each step according to our best theory of where to make it, and use the understanding gleaned from that step to update the theory (when necessary), which could well mean that we retrace some steps and recalibrate (this can be the case within and across questions). I think this connects to what both Charlie and Tekhne have said, though I believe Tekhne could have been more charitable.
In this light, it makes sense to emphasize the openness of the theory to being updated in this way, which also qualifies the ways in which the theory is allowed to be yet incomplete. Putting more effort into clarifying how this update process should look like seems like a promising addition to the framework that you propose.
On a more specific note I felt that Q5 could just be in position 2 and maybe a sixth question would be “What is the predicted timeline for stable safety/control implementations?” or something of the sort.
I also think that phrasing our research in terms of “avoiding bad outcomes” and “controlling the AGI” biases the way in which we pay attention to these problems. I am sure that you will also touch on this in the more detailed presentation of these questions, but at the resolution presented here, I would prefer the phrasing to be more open. “Aiming at good outcomes while/and avoiding bad outcomes” captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. It might rightfully appear that this is a more ambitious aim, but it is the less predisposed outlook! Many strategy games are based on the idea that you have to accumulate resources and avoid losses while at the same time improving your ability to accumulate resources and avoid losses in the future. Only focusing on the first aspect is a specific strategy in the space of possible ones, and often employed when one is close to losing. This isn’t a perfect analogy in a number of ways, but serves to point out the more general outlook. Similarly, we expect a superintelligent AGI to be out of our ability to control at some point, which invokes notions of “self-control” on part of the AGI or “justified trust” on our part—therefore, perhaps “influencing the development of the AGI” would be better, as, again, “influence” can cover more conceptual ground but can still be hardened into the more specific notion of “control” when appropriate.
it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
“Aiming at good outcomes while/and avoiding bad outcomes” captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there.
I definitely agree with the value of framing AGI outcomes both positively and negatively, as I discuss in the previous post. I am less sure that AGI safety as a field necessarily requires deeply considering the positive potential of AGI (i.e., as long as AGI-induced existential risks are avoided, I think AGI safety researchers can consider their venture successful), but, much to your point, if the best way of actually achieving this outcome is by thinking about AGI more holistically—e.g., instead of explicitly avoiding existential risks, we might ask how to build an AGI that we would want to have around—then I think I would agree. I just think this sort of thing would radically redefine the relevant approaches undertaken in AGI safety research. I by no means want to reject radical redefinitions out of hand (I think this very well could be correct); I just want to say that it is probably not the path of least resistance given where the field currently stands.
I believe that it is very sensible to bring this sort of structure into our approach to AGI safety research, but at the same time it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
It is a feedback loop where we make each step according to our best theory of where to make it, and use the understanding gleaned from that step to update the theory (when necessary), which could well mean that we retrace some steps and recalibrate (this can be the case within and across questions). I think this connects to what both Charlie and Tekhne have said, though I believe Tekhne could have been more charitable.
In this light, it makes sense to emphasize the openness of the theory to being updated in this way, which also qualifies the ways in which the theory is allowed to be yet incomplete. Putting more effort into clarifying how this update process should look like seems like a promising addition to the framework that you propose.
On a more specific note I felt that Q5 could just be in position 2 and maybe a sixth question would be “What is the predicted timeline for stable safety/control implementations?” or something of the sort.
I also think that phrasing our research in terms of “avoiding bad outcomes” and “controlling the AGI” biases the way in which we pay attention to these problems. I am sure that you will also touch on this in the more detailed presentation of these questions, but at the resolution presented here, I would prefer the phrasing to be more open.
“Aiming at good outcomes while/and avoiding bad outcomes” captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. It might rightfully appear that this is a more ambitious aim, but it is the less predisposed outlook! Many strategy games are based on the idea that you have to accumulate resources and avoid losses while at the same time improving your ability to accumulate resources and avoid losses in the future. Only focusing on the first aspect is a specific strategy in the space of possible ones, and often employed when one is close to losing. This isn’t a perfect analogy in a number of ways, but serves to point out the more general outlook.
Similarly, we expect a superintelligent AGI to be out of our ability to control at some point, which invokes notions of “self-control” on part of the AGI or “justified trust” on our part—therefore, perhaps “influencing the development of the AGI” would be better, as, again, “influence” can cover more conceptual ground but can still be hardened into the more specific notion of “control” when appropriate.
Hey Robert—thanks for your comment!
Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
I definitely agree with the value of framing AGI outcomes both positively and negatively, as I discuss in the previous post. I am less sure that AGI safety as a field necessarily requires deeply considering the positive potential of AGI (i.e., as long as AGI-induced existential risks are avoided, I think AGI safety researchers can consider their venture successful), but, much to your point, if the best way of actually achieving this outcome is by thinking about AGI more holistically—e.g., instead of explicitly avoiding existential risks, we might ask how to build an AGI that we would want to have around—then I think I would agree. I just think this sort of thing would radically redefine the relevant approaches undertaken in AGI safety research. I by no means want to reject radical redefinitions out of hand (I think this very well could be correct); I just want to say that it is probably not the path of least resistance given where the field currently stands.
(And agreed on the self-control point, as you know. See directionality of control in Q3.)