If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.
I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.
I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.