The second sentence doesn’t seem to follow logically from the first.
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
This is not obvious to me. Can you elaborate or give an example?
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Can you explain how one might train a human not to have the second problem in this post?
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct.
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
I was imagining that each AI has its own highly trained supervisor(s).
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right ?
Probably? I haven’t thought about it much.
don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
Skin in the Game (SSC)
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
Probably? I haven’t thought about it much.
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.