The argument for time-sensitivity is that we might be able to increase the chance of future AI systems doing ECL in worlds where we cannot do so later
What are some ideas for how to increase the chance of future AI systems doing ECL?
An obvious approach is to give AIs a decision theory that is likely to recommend ECL, but given how many open problems there are in decision theory (as well as the apparent trajectory of research progress), I think we’re unlikely to solve it well enough in the relevant time-frame to be comfortable with letting AI use some human-specified decision theory to make highly consequential decisions like whether or not to do ECL (not to mention how exactly to do ECL). Instead it seems advisable to try to ensure that AI will be philosophically competent and then let it fully solve decision theory using its own superior intellect before making such highly consequential decisions.
I’m guessing you may have a different perspective or different ideas, and I’m curious to learn what they are.
Thanks! I actually agree with a lot of what you say. Lack of excitement about existing intervention ideas is part of the reason why I’m not all in on this agenda at the moment. Although in part I’m just bottlenecked by lack of technical expertise (and it’s not like people had great ideas for how to align AIs at the beginning of the field...), so I don’t want people to overupdate from “Chi doesn’t have great ideas.”
With that out of the way, here are some of my thoughts:
We can try to prevent silly path-dependencies in (controlled or uncontrolled i.e. misaligned) AIs. As a start, we can use DT benchmarks to study how DT endorsements and behaviour change under different conditions and how DT competence scales with size compared to other capabilities. I think humanity is unlikely to care a ton about AI’s DT views and there might be path-dependencies. So like, I guess I’m saying I agree with “let’s try to make the AI philosophically competent.”
This depends a lot on whether you think there are any path-dependencies conditional on ~solving alignment. Or if humanity will, over time, just be wise enough to figure everything out regardless of the starting point.
One source of silly path-dependencies is if AIs’ native DT depends on the training process and we want to de-bias against that. (See for example this or this for some research on what different training processes should incentivise.) Honestly, I have no idea how much things like that matter. Humans aren’t all CDT even though my very limited understanding of evolution is that it should, in the limit, incentivise CDT.
I think depending on what you think about the default of how AIs/AI-powered earth-originating civilisation will arrive at conclusions about ECL, you might think some nudging towards the DT views you favour is more or less justified. Maybe we can also find properties of DTs that we are more confident in (e.g. “does this or that in decision problem X” than whole specified DTs, which, yeah, I have no clue. Other than “probably not CDT.”
If the AI is uncontrolled/misaligned, there are things we can do to make it more likely it is interested in ECL, which I expect to be net good for the agents I try to acausally cooperate with. For example, maybe we can make misaligned AI’s utility function more likely to have diminishing returns or do something else that would make its values more porous. (I’m using the term in a somewhat broader way than Bostrom.)
This depends a lot on whether you think we have any influence over AIs we don’t fully control.
It might be important and mutable that future AIs don’t take any actions that decorrelate them with other agents (i.e. does things that decrease the AI’s acausal influence) before they discover and implement ECL. So, we might try to just make it aware of that early.
You might think that’s just not how correlation or updatelessness work, such that there’s no rush. Or that this is a potential source of value loss but a pretty negligible one.
Things that aren’t about making AIs more likely to do ECL: Something not mentioned, but there might be some trades that we have to do now. For example, maybe ECL makes it super important to be nice to AIs we’re training. (I am mostly lean no on this question (at least for “super important”) but it’s confusing.) I also find it plausible we want to do ECL with other pre-ASI civilisations who might or might not succeed at alignment and, if we succeed and they fail, part-optimise for their values. It’s unclear to me whether this requires us to get people to spiritually commit to this now before we know whether we’ll succeed at alignment or not. Or whether updatelessness somehow sorts this because if we (or the other civ) were to succeed at alignment, we would have seen that this is the right policy, and done this retroactively.
What are some ideas for how to increase the chance of future AI systems doing ECL?
An obvious approach is to give AIs a decision theory that is likely to recommend ECL, but given how many open problems there are in decision theory (as well as the apparent trajectory of research progress), I think we’re unlikely to solve it well enough in the relevant time-frame to be comfortable with letting AI use some human-specified decision theory to make highly consequential decisions like whether or not to do ECL (not to mention how exactly to do ECL). Instead it seems advisable to try to ensure that AI will be philosophically competent and then let it fully solve decision theory using its own superior intellect before making such highly consequential decisions.
I’m guessing you may have a different perspective or different ideas, and I’m curious to learn what they are.
Thanks! I actually agree with a lot of what you say. Lack of excitement about existing intervention ideas is part of the reason why I’m not all in on this agenda at the moment. Although in part I’m just bottlenecked by lack of technical expertise (and it’s not like people had great ideas for how to align AIs at the beginning of the field...), so I don’t want people to overupdate from “Chi doesn’t have great ideas.”
With that out of the way, here are some of my thoughts:
We can try to prevent silly path-dependencies in (controlled or uncontrolled i.e. misaligned) AIs. As a start, we can use DT benchmarks to study how DT endorsements and behaviour change under different conditions and how DT competence scales with size compared to other capabilities. I think humanity is unlikely to care a ton about AI’s DT views and there might be path-dependencies. So like, I guess I’m saying I agree with “let’s try to make the AI philosophically competent.”
This depends a lot on whether you think there are any path-dependencies conditional on ~solving alignment. Or if humanity will, over time, just be wise enough to figure everything out regardless of the starting point.
One source of silly path-dependencies is if AIs’ native DT depends on the training process and we want to de-bias against that. (See for example this or this for some research on what different training processes should incentivise.) Honestly, I have no idea how much things like that matter. Humans aren’t all CDT even though my very limited understanding of evolution is that it should, in the limit, incentivise CDT.
I think depending on what you think about the default of how AIs/AI-powered earth-originating civilisation will arrive at conclusions about ECL, you might think some nudging towards the DT views you favour is more or less justified. Maybe we can also find properties of DTs that we are more confident in (e.g. “does this or that in decision problem X” than whole specified DTs, which, yeah, I have no clue. Other than “probably not CDT.”
If the AI is uncontrolled/misaligned, there are things we can do to make it more likely it is interested in ECL, which I expect to be net good for the agents I try to acausally cooperate with. For example, maybe we can make misaligned AI’s utility function more likely to have diminishing returns or do something else that would make its values more porous. (I’m using the term in a somewhat broader way than Bostrom.)
This depends a lot on whether you think we have any influence over AIs we don’t fully control.
It might be important and mutable that future AIs don’t take any actions that decorrelate them with other agents (i.e. does things that decrease the AI’s acausal influence) before they discover and implement ECL. So, we might try to just make it aware of that early.
You might think that’s just not how correlation or updatelessness work, such that there’s no rush. Or that this is a potential source of value loss but a pretty negligible one.
Things that aren’t about making AIs more likely to do ECL: Something not mentioned, but there might be some trades that we have to do now. For example, maybe ECL makes it super important to be nice to AIs we’re training. (I am mostly lean no on this question (at least for “super important”) but it’s confusing.) I also find it plausible we want to do ECL with other pre-ASI civilisations who might or might not succeed at alignment and, if we succeed and they fail, part-optimise for their values. It’s unclear to me whether this requires us to get people to spiritually commit to this now before we know whether we’ll succeed at alignment or not. Or whether updatelessness somehow sorts this because if we (or the other civ) were to succeed at alignment, we would have seen that this is the right policy, and done this retroactively.