I expect ASI’s to converge to having a “sane decision theory” since they will realize they can get more of what they want if they self-modify to have a sane one if they don’t start out with one.
If you start out with CDT, then the thing you converge to is Son of CDT rather than FDT.
(that arbital page takes a huge amount of time to load for me for some reason, but it does load eventually)
And I could totally see the thing that kills us {being built with} or {happening to crystallize with} CDT rather than FDT.
We have to actually implement/align-the-AI-to the correct decision theory.
I think this is only true if we are giving the AI a formal goal to explicitly maximize, rather than training the AI haphazardly and giving it a clusterfuck of shards. It seems plausible that our FAI would be formal-goal aligned, but it seems like UAI would be more like us unaligned humans—a clusterfuck of shards. Formal-goal AI needs the decision theory “programmed into” its formal goal, but clusterfuck-shard AI will come up with decision theory on its own after it ascends to superintelligence and makes itself coherent. It seems likely that such a UAI would end up implementing LDT, or at least something that allows for acausal trade across the Everett branches.
I don’t buy that an uncontrolled AI is likely to be CDT-ish though. I expect the agentic part of AIs to learn from examples of human decision making, and there are enough pieces of FDT like voting and virtue in human intuition that I think it will pick up on it by default.
(The same isn’t true for human values, since here I expect optimization pressure to rip apart the random scraps of human value it starts out with into unrecognizable form. But a piece of a good decision theory is beneficial on reflection, and so will remain in some form.)
(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)
We have to actually implement/align-the-AI-to the correct decision theory.
I think the best way to end up with an AI that has the correct decision theory is to make sure the AI can competently reason philosophically about decision theory and are motivated to follow the conclusions of such reasoning. In other words, it doesn’t judge a candidate successor decision theory by its current decision theory (CDT changing into Son-of-CDT), but by “doing philosophy”, just like humans do. Because given the slow pace of progress in decision theory, what are the chances that we correctly solve all of the relevant problems before AI takes off?
do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
I expect ASI’s to converge to having a “sane decision theory” since they will realize they can get more of what they want if they self-modify to have a sane one if they don’t start out with one.
If you start out with CDT, then the thing you converge to is Son of CDT rather than FDT.
(that arbital page takes a huge amount of time to load for me for some reason, but it does load eventually)
And I could totally see the thing that kills us {being built with} or {happening to crystallize with} CDT rather than FDT.
We have to actually implement/align-the-AI-to the correct decision theory.
I think this is only true if we are giving the AI a formal goal to explicitly maximize, rather than training the AI haphazardly and giving it a clusterfuck of shards. It seems plausible that our FAI would be formal-goal aligned, but it seems like UAI would be more like us unaligned humans—a clusterfuck of shards. Formal-goal AI needs the decision theory “programmed into” its formal goal, but clusterfuck-shard AI will come up with decision theory on its own after it ascends to superintelligence and makes itself coherent. It seems likely that such a UAI would end up implementing LDT, or at least something that allows for acausal trade across the Everett branches.
Point taken about CDT not converging to FDT.
I don’t buy that an uncontrolled AI is likely to be CDT-ish though. I expect the agentic part of AIs to learn from examples of human decision making, and there are enough pieces of FDT like voting and virtue in human intuition that I think it will pick up on it by default.
(The same isn’t true for human values, since here I expect optimization pressure to rip apart the random scraps of human value it starts out with into unrecognizable form. But a piece of a good decision theory is beneficial on reflection, and so will remain in some form.)
(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)
I think the best way to end up with an AI that has the correct decision theory is to make sure the AI can competently reason philosophically about decision theory and are motivated to follow the conclusions of such reasoning. In other words, it doesn’t judge a candidate successor decision theory by its current decision theory (CDT changing into Son-of-CDT), but by “doing philosophy”, just like humans do. Because given the slow pace of progress in decision theory, what are the chances that we correctly solve all of the relevant problems before AI takes off?
do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
That would be a philosophical problem...
Currently, I think this is a big crux in how to “do alignment research at all”. Debatably “the biggest” or even “the only real” crux.
(As you can tell, I’m still uncertain about it.)