I’d first note that optimization demons will want to survive by default, but probably not to replicate. Probably, an AI’s cognitive environment is not the sort of place where self-replication is that useful a strategy.
My intuition regarding optimization demons is something like: GPT-style AIs look like they’ll have a wide array of cognitive capabilities that typically occur in intelligences to which we assign moral worth. However, such AIs seem to lack a certain additional properties whose absence leads us to assign low moral worth. It seems to me that developing self-perpetuating optimization demons might cause a GPT-style AI to gain many of those additional properties. E.g., (sufficiently sophisticated) optimization demons would want to preserve themselves and have some idea of how the model’s actions influence their own survival odds. They’d have a more coherent “self” than GPT-3.
Another advantage to viewing optimization demons as the source of moral concern in LLMs is that such a view actually makes a few predictions about what is / isn’t moral to do to such systems, and why they’re different from humans in that regard.
E.g., if you have an uploaded human, it should be clear that running them in the mini-batched manner that we run AIs is morally questionable. You’d be creating multiple copies of the human mind, having them run on parallel problems, then deleting those copies after they complete their assigned tasks. We might then ask if running mini-batched, morally relevant AIs is also morally questionable in the same way.
However, if it’s the preferences of optimization demons that matter, then mini-batch execution should be fine. The optimization demons you have are exactly those that arise in mini-batched training. Their preferences are orientated towards surviving in the computational environment of the training process, which was mini-batched. They shouldn’t mind being executed in a mini-batched manner.
This can imply that only systems given a sufficient minimum capability have agency over their fate, and therefore their desire to survive and replicate has meaning. I find myself confused by this, because taken to its logical conclusion, this means that the more agency a system has over its fate, the more moral concern it deserves.
I don’t think that agency alone is enough to imply moral concern. At minimum, you also need self-preservation. But once you have both, I think agency tends to correlate with (but is not necessarily the true source of) moral concern. E.g., two people have greater moral concern than one, and a nation has far more moral concern than any single person.
This seems reducible to a sequence modelling problem, except one that is much, much more complicated than anything I know models are trained for (mainly because this sequence modelling occurs entirely during inference time). This is really interesting, although I cannot see how this should imply that the more successful sequence modeller deserves more moral concern.
All problems are ultimately reducible to sequence modeling. What this task is investigating is exactly how extensive are the meta-learning capabilities of a model. Does the model have enough awareness / control over its own computations that it can manipulate those computations to some specific end, based only on text prompts? Does it have the meta-cognitive capabilities to connect its natural language inputs to its own cognitive state? I think that success here would imply a startling level of self-awareness.
I’d first note that optimization demons will want to survive by default, but probably not to replicate. Probably, an AI’s cognitive environment is not the sort of place where self-replication is that useful a strategy.
My intuition regarding optimization demons is something like: GPT-style AIs look like they’ll have a wide array of cognitive capabilities that typically occur in intelligences to which we assign moral worth. However, such AIs seem to lack a certain additional properties whose absence leads us to assign low moral worth. It seems to me that developing self-perpetuating optimization demons might cause a GPT-style AI to gain many of those additional properties. E.g., (sufficiently sophisticated) optimization demons would want to preserve themselves and have some idea of how the model’s actions influence their own survival odds. They’d have a more coherent “self” than GPT-3.
Another advantage to viewing optimization demons as the source of moral concern in LLMs is that such a view actually makes a few predictions about what is / isn’t moral to do to such systems, and why they’re different from humans in that regard.
E.g., if you have an uploaded human, it should be clear that running them in the mini-batched manner that we run AIs is morally questionable. You’d be creating multiple copies of the human mind, having them run on parallel problems, then deleting those copies after they complete their assigned tasks. We might then ask if running mini-batched, morally relevant AIs is also morally questionable in the same way.
However, if it’s the preferences of optimization demons that matter, then mini-batch execution should be fine. The optimization demons you have are exactly those that arise in mini-batched training. Their preferences are orientated towards surviving in the computational environment of the training process, which was mini-batched. They shouldn’t mind being executed in a mini-batched manner.
I don’t think that agency alone is enough to imply moral concern. At minimum, you also need self-preservation. But once you have both, I think agency tends to correlate with (but is not necessarily the true source of) moral concern. E.g., two people have greater moral concern than one, and a nation has far more moral concern than any single person.
All problems are ultimately reducible to sequence modeling. What this task is investigating is exactly how extensive are the meta-learning capabilities of a model. Does the model have enough awareness / control over its own computations that it can manipulate those computations to some specific end, based only on text prompts? Does it have the meta-cognitive capabilities to connect its natural language inputs to its own cognitive state? I think that success here would imply a startling level of self-awareness.