It seems quite difficult to me to build AI systems that are safe, without having them rely on humans in some way to determine what is and is not good behavior. As a result, I’m pessimistic about our chances at creating AI systems that can address “human safety problems”.
The second sentence doesn’t seem to follow logically from the first. There are many ways that an AI can rely on humans to determine what is and is not good behavior, some of which put more stress on human safety than others. At the most ideal (least reliant on humans being safe) end of the spectrum, we could solve metaphilosophy in a white-box way, and have the AI extract true/normative values from humans using its own philosophical understanding of what that means.
Aside from the “as a result” part, I’m also pessimistic but I don’t see a reason to be so pessimistic that the current level of neglect could be justified.
in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).
This is not obvious to me. Can you elaborate or give an example?
Idealized humans can be thought of an instance of this approach where rather than change the policy of humans in reality, we change their policy in a hypothetical.
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
Overall, I’m hoping that we can solve “human safety problems” by training the humans supervising the AI to not have those problems
Can you explain how one might train a human not to have the second problem in this post?
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
The second sentence doesn’t seem to follow logically from the first.
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
This is not obvious to me. Can you elaborate or give an example?
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Can you explain how one might train a human not to have the second problem in this post?
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct.
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
I was imagining that each AI has its own highly trained supervisor(s).
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right ?
Probably? I haven’t thought about it much.
don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.
The second sentence doesn’t seem to follow logically from the first. There are many ways that an AI can rely on humans to determine what is and is not good behavior, some of which put more stress on human safety than others. At the most ideal (least reliant on humans being safe) end of the spectrum, we could solve metaphilosophy in a white-box way, and have the AI extract true/normative values from humans using its own philosophical understanding of what that means.
Aside from the “as a result” part, I’m also pessimistic but I don’t see a reason to be so pessimistic that the current level of neglect could be justified.
This is not obvious to me. Can you elaborate or give an example?
Idealized humans can be safer not just because they have better policies, but also because they have safer environments.
Can you explain how one might train a human not to have the second problem in this post?
ETA: Also, it seems like you’re saying that a small group of highly trained supervisors will act as gatekeepers to all AIs, and use their judgement to, for example, deny requests for technologies that they don’t think humans are ready for yet. Is that right?
True. I don’t know exactly how to convey this intuition. I think the part that seems very hard to do is to get the AI system to outperform humans at figuring out what is good to do. But this is not exactly true—certainly at chess an AI system can outperform us choosing the best action. So maybe we could say that it’s very hard to get the AI system to outperform humans at figuring out the best criteria for figuring out what is good to do. But I don’t really believe that either—I expect that AI systems will be able to do philosophy better than we will at some point.
Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct. (This very directly applies to the most ideal scenario that you laid out.)
Skin in the Game (SSC)
A particular example from the post: “A really good example of this is Hitchens’ waterboarding – he said he thought it was an acceptable interrogation technique, some people offered to waterboard him to see if he changed his mind, and he quickly did. I’m fascinated by this incident because it’s hard for me to understand, in a Mary’s Room style sense, what he learned from the experience. He must have already known it was very unpleasant – otherwise why would you even support it as useful for interrogation? But somehow there’s a difference between having someone explain to you that waterboarding is horrible, and undergoing it yourself.”
Also all of the worries about paternalism in the charity sector.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
I was imagining that the better policies are a result of the safer environments. (Like, if you’re using something like IRL to infer values, then the safe environments cause different behavior and answers to questions which cause better inferred values.)
Changed “we change their policy in a hypothetical” to “we change their policy in a hypothetical by putting them in safer environments”.
Good epistemics goes some of the way, but I mostly wasn’t thinking about those sorts of situations. I’ll comment on that post as well, let’s discuss further there.
Basically yes. I was imagining that each AI has its own highly trained supervisor(s). I’m not sure that the supervisors will be denying requests for technologies, more that the AI will learn patterns of reasoning and thinking from the supervisors that cause it to actually evaluate whether humans are ready for some particular technology rather than just giving it to humans without thinking about the safety implications. (This does require that the supervisors would make a similar decision if the problem was posed to them.)
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
Probably? I haven’t thought about it much.
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.