we haven’t seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.
It’s a good point, but… we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don’t think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.
Now, one can say that this actually means we don’t care that much about the long-term future, but I don’t agree with that conclusion; I think we *do* care (at least, I do), but aren’t very good at acting as if we(/I) do.
Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a “positive-sum” trade with us (e.g. “I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want”).
===========================
I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn’t seem mature.… at least I’m not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute)
It isn’t obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I’m not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.
Diamond maximization seems pretty different from winning at chess. In the chess case, we’ve essentially hardcoded a particular ontology related to a particular imaginary universe, the chess universe. This isn’t a feasible approach for the diamond problem.
In any case, the reason this discussion is relevant, from my perspective, is because it’s related to the question of whether you could have a system which constructs its own superintelligent understanding of the world (e.g. using self-supervised learning), and engages in self-improvement (using some process analogous to e.g. neural architecture search) without being goal-directed. If so, you could presumably pinpoint human values/corrigibility/etc. in the model of the world that was created (using labeled data, active learning, etc.) and use that as an agent’s reward function. (Or just use the self-supervised learning system as a tool to help with FAI research/make a pivotal act/etc.)
It feels to me as though the thing I described in the previous paragraph is amenable to the same general kind of ontological whitelisting approach that we use for chess AIs. (To put it another way, I suspect most insights about meta-learning can be encoded without referring to a lot of object level content about the particular universe you find yourself building a model of.) I do think there are some safety issues with the approach I described, but they seem fairly possible to overcome.
we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
Are you suggesting that Deep Blue would behave in this way if we gave it enough time to run? If so, can you explain the mechanism by which this would occur?
I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
It’s a good point, but… we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don’t think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.
Now, one can say that this actually means we don’t care that much about the long-term future, but I don’t agree with that conclusion; I think we *do* care (at least, I do), but aren’t very good at acting as if we(/I) do.
Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a “positive-sum” trade with us (e.g. “I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want”).
===========================
I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn’t seem mature.… at least I’m not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
It isn’t obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I’m not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.
Diamond maximization seems pretty different from winning at chess. In the chess case, we’ve essentially hardcoded a particular ontology related to a particular imaginary universe, the chess universe. This isn’t a feasible approach for the diamond problem.
In any case, the reason this discussion is relevant, from my perspective, is because it’s related to the question of whether you could have a system which constructs its own superintelligent understanding of the world (e.g. using self-supervised learning), and engages in self-improvement (using some process analogous to e.g. neural architecture search) without being goal-directed. If so, you could presumably pinpoint human values/corrigibility/etc. in the model of the world that was created (using labeled data, active learning, etc.) and use that as an agent’s reward function. (Or just use the self-supervised learning system as a tool to help with FAI research/make a pivotal act/etc.)
It feels to me as though the thing I described in the previous paragraph is amenable to the same general kind of ontological whitelisting approach that we use for chess AIs. (To put it another way, I suspect most insights about meta-learning can be encoded without referring to a lot of object level content about the particular universe you find yourself building a model of.) I do think there are some safety issues with the approach I described, but they seem fairly possible to overcome.
I strongly agree.
I should’ve been more clear.
I think this is a situation where our intuition is likely wrong.
This sort of thing is why I say “I’m not satisfied with my current understanding”.
Are you suggesting that Deep Blue would behave in this way if we gave it enough time to run? If so, can you explain the mechanism by which this would occur?
Can you share links?
I don’t know how deep blue worked. My impression was that it doesn’t use learning, so the answer would be no.
A starting point for Tom and Stuart’s works: https://scholar.google.com/scholar?rlz=1C1CHBF_enCA818CA819&um=1&ie=UTF-8&lr&cites=1927115341710450492
BoMAI is in this vein, as well ( https://arxiv.org/pdf/1905.12186.pdf )