The main counter-arguments arise from VNMUT, which can be interpreted as saying “rational agents are more fit” (in an evolutionary sense).
While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous in this setting. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that’s the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn’t apply to, e.g. Google Maps.
I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.
RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias “profile”, as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau’s ( https://arxiv.org/abs/1805.12387 ).
I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I’m not sure how.)
Maybe another way to get at the underlying misunderstanding: do you see a difference between “convergent rationality” and “convergent goal-directedness”? If so, what is it? From what you’ve written they sound equivalent to me. ETA: Actually it’s more like “convergent rationality” and “convergent competent goal-directedness”.
That’s a reasonable position, but I think the reality is that we just don’t know. Moreover, it seems possible to build goal-directed agents that don’t become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.
EDIT: the above was in response to your first paragraph. I think I didn’t respond RE the 2nd paragraph because I don’t know what “convergent goal-directedness” refers to, and was planning to read your sequence but never got around to it.
I don’t know what “convergent goal-directedness” refers to, and was planning to read your sequence but never got around to it.
I would guess that Chapter 2 of that sequence would be the most (relevant + important) piece of writing for you (w.r.t this post in particular), though I’m not sure about the relevance.
While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous in this setting. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that’s the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn’t apply to, e.g. Google Maps.
I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.
RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias “profile”, as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau’s ( https://arxiv.org/abs/1805.12387 ).
I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I’m not sure how.)
Maybe another way to get at the underlying misunderstanding: do you see a difference between “convergent rationality” and “convergent goal-directedness”? If so, what is it? From what you’ve written they sound equivalent to me. ETA: Actually it’s more like “convergent rationality” and “convergent competent goal-directedness”.
That’s a reasonable position, but I think the reality is that we just don’t know. Moreover, it seems possible to build goal-directed agents that don’t become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.
EDIT: the above was in response to your first paragraph. I think I didn’t respond RE the 2nd paragraph because I don’t know what “convergent goal-directedness” refers to, and was planning to read your sequence but never got around to it.
I would guess that Chapter 2 of that sequence would be the most (relevant + important) piece of writing for you (w.r.t this post in particular), though I’m not sure about the relevance.