I would honestly be pretty comfortable with maximizing SBF’s CEV.
Yikes, I’m not even comfortable maximizing my own CEV. One crux may be that I think a human’s values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn’t have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human’s reflection process which may be mistaken or selfish or skewed.
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?
Where is the longer version of this? I do want to read it. :)
Well perhaps I should write it :)
Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won’t self-deceive.
Again, though, I’m not super confident in this. Deep deception or similar could really screw us over.
Also, how does RL fit into QACI? Can you point me to where this is discussed?
I have no idea how Tammy plans to “train” the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it’s really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they’re just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.
Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn’t the same thing happen to them?
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
We already RL-punish AIs for saying things that we don’t like (via RLHF), and in the future will probably punish them for thinking things we don’t like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.
Yikes, I’m not even comfortable maximizing my own CEV. One crux may be that I think a human’s values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn’t have trusted his future self.)
My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human’s reflection process which may be mistaken or selfish or skewed.
Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Also, how does RL fit into QACI? Can you point me to where this is discussed?
What do you think of this post by Tammy?
Well perhaps I should write it :)
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won’t self-deceive.
Again, though, I’m not super confident in this. Deep deception or similar could really screw us over.
I have no idea how Tammy plans to “train” the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it’s really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they’re just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.
Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn’t the same thing happen to them?
We already RL-punish AIs for saying things that we don’t like (via RLHF), and in the future will probably punish them for thinking things we don’t like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.