I’m 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call “human values”) if they were rational enough and had good enough decision theory.
If I’m wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist’s curse. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)
I agree that moral uncertainty is a very hard problem, but I don’t think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.
To solve (2), I think we should try to adapt something like the Hippocratic principle to work for QACI, without requiring direct reference to a human’s values and beliefs (the sidestepping of which is QACI’s big advantage over PreDCA). I wonder if Tammy has thought about this.
Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
But we could have said the same thing of SBF, before the disaster happened.
Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.
Please explain your thinking behind this?
Dealing with moral uncertainty is just part of expected utility maximization.
It’s not, because some moral theories are not compatible with EU maximization, and of the ones that are, it’s still unclear how to handle uncertainty between them.
But we could have said the same thing of SBF, before the disaster happened.
I would honestly be pretty comfortable with maximizing SBF’s CEV.
Please explain your thinking behind this?
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Sorrry if I was unclear there.
It’s not, because some moral theories are not compatible with EU maximization.
I’m pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong.
And I think this uncertainty problem can be solved by forcing utility bounds.
I would honestly be pretty comfortable with maximizing SBF’s CEV.
Yikes, I’m not even comfortable maximizing my own CEV. One crux may be that I think a human’s values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn’t have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human’s reflection process which may be mistaken or selfish or skewed.
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?
Where is the longer version of this? I do want to read it. :)
Well perhaps I should write it :)
Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won’t self-deceive.
Again, though, I’m not super confident in this. Deep deception or similar could really screw us over.
Also, how does RL fit into QACI? Can you point me to where this is discussed?
I have no idea how Tammy plans to “train” the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it’s really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they’re just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.
Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn’t the same thing happen to them?
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
We already RL-punish AIs for saying things that we don’t like (via RLHF), and in the future will probably punish them for thinking things we don’t like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.
I’m 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call “human values”) if they were rational enough and had good enough decision theory.
If I’m wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist’s curse. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)
I agree that moral uncertainty is a very hard problem, but I don’t think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.
To solve (2), I think we should try to adapt something like the Hippocratic principle to work for QACI, without requiring direct reference to a human’s values and beliefs (the sidestepping of which is QACI’s big advantage over PreDCA). I wonder if Tammy has thought about this.
But we could have said the same thing of SBF, before the disaster happened.
Please explain your thinking behind this?
It’s not, because some moral theories are not compatible with EU maximization, and of the ones that are, it’s still unclear how to handle uncertainty between them.
I would honestly be pretty comfortable with maximizing SBF’s CEV.
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Sorrry if I was unclear there.
I’m pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong.
And I think this uncertainty problem can be solved by forcing utility bounds.
Yikes, I’m not even comfortable maximizing my own CEV. One crux may be that I think a human’s values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn’t have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human’s reflection process which may be mistaken or selfish or skewed.
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?
What do you think of this post by Tammy?
Well perhaps I should write it :)
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won’t self-deceive.
Again, though, I’m not super confident in this. Deep deception or similar could really screw us over.
I have no idea how Tammy plans to “train” the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it’s really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they’re just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.
Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn’t the same thing happen to them?
We already RL-punish AIs for saying things that we don’t like (via RLHF), and in the future will probably punish them for thinking things we don’t like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.