Coming at it from another angle, I think I would say that thousands of years of human intellectual effort has been put into trying to figure out what is good to do, without much consensus. It seems very hard to build an AI system that is going to outperform all of that intellectual effort and be completely correct.
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
Also there’s the simple distributional shift argument: the circumstances that real humans are in form a distributional shift for the idealized humans, so we shouldn’t trust their judgment.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
I was imagining that each AI has its own highly trained supervisor(s).
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right ?
Probably? I haven’t thought about it much.
don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.
This doesn’t seem to follow logically either. Thousands of years of intellectual effort amounts to a tiny fraction of the computation that the universe could do. Why is that a reason to think it very hard to outperform humans? And even if the white-box metaphilosophical approach doesn’t work out, there are likely other ways to improve upon human performance, even if the result isn’t “completely correct”. (Did you perhaps interpret “address human safety problems” as “fully solve human safety problems”? I actually picked the word “address” over “solve” so as to not imply that a full solution must be possible.)
Personally, I’m pessimistic because there probably won’t be enough time to figure out and implement these ideas in a safe way before someone builds an unaligned or aligned-but-unsafe AI.
But the idealized humans are free to broaden their “distribution” (e.g., experience waterboarding), when they figure out how to do that in a safe way. The difference is that unlike the real humans, they won’t be forced to deal with strange new inputs before they are ready.
And these AIs/supervisors also act as a cabal to stop anyone else from running an AI without going through the same training, right? I guess this is another approach that should be considered, but don’t you think putting humans in such positions of power in itself carries a high risk of corruption, and that it will be hard to come up with training that can reliably prevent such corruption?
I meant more that there has been a lot of effort into figuring out how to normatively talk about what “good” is, and it seems necessary for us to figure that out if you want to write down code that we can know beforehand will correctly extrapolate our values even though it cannot rely on us avoiding corrupted values. Now admittedly in the past people weren’t specifically thinking about the problem of how to use a lot of computation to do this, but many of the thought experiments they propose seem to be of a similar nature (eg. suppose that we were logically omniscient and knew the consequences of our beliefs).
It’s possible that this is mostly me misunderstanding terminology. Previously it sounded like you were against letting humans discover their own values through experience and instead having them figure it out through deliberation and reflection from afar. Now it sounds like you actually do want humans to discover their own values through experience, but you want them to be able to control the experiences and take them at their own pace.
Thanks, that clarifies things a lot. I take back the statement about trusting real humans more.
Probably? I haven’t thought about it much.
I agree this seems potentially problematic, I’m not sure that it’s a deal-breaker.
I don’t think it’s necessary for us to figure out what “good” is, instead my “white-box metaphilosophical approach” is to figure out what “doing philosophy” is, program/teach an AI to “do philosophy” which would let it figure out what “good” is on its own. I think there has not been a lot of effort invested into that problem so there’s a somewhat reasonable chance that it might be tractable. It seems worth investigating especially given that it might be the only way to fully solve the human safety problem.
To clarify, this is a separate approach from the white-box metaphilosophical approach. And I’m not sure whether I want the humans to discover their values through experience or through deliberation. I guess they should first deliberate on that choice and then do whatever they decide is best.