A major issue with this topic is the way LLM simulacra are not like other hypothetical AGIs. For an arbitrary AGI, there is no reason to expect it to do anything remotely reasonable, and in principle it could be pursuing any goal with unholy intensity (orthogonality thesis). We start with something that’s immensely dangerous and can’t possibly be of use in its original form. So there are all these ideas about how to point it in useful directions floating around, in a way that lets us keep our atoms, that’s AI alignment as normally understood.
But an LLM simulacrum is more like an upload, a human imitation that’s potentially clear-headed enough to make the kinds of decisions and research progress that a human might, faster (because computers are not made out of meat). Here, we start with something that might be OK in its original form, and any interventions that move it away from that are conductive to making it a dangerous alien or insane or just less inclined to be cooperative. Hence improvements in thingnessof simulacra might help, while slicing around in their minds with the RLHF icepick might bring this unexpected opportunity to ruin.
A major issue with this topic is the way LLM simulacra are not like other hypothetical AGIs. For an arbitrary AGI, there is no reason to expect it to do anything remotely reasonable, and in principle it could be pursuing any goal with unholy intensity (orthogonality thesis). We start with something that’s immensely dangerous and can’t possibly be of use in its original form. So there are all these ideas about how to point it in useful directions floating around, in a way that lets us keep our atoms, that’s AI alignment as normally understood.
But an LLM simulacrum is more like an upload, a human imitation that’s potentially clear-headed enough to make the kinds of decisions and research progress that a human might, faster (because computers are not made out of meat). Here, we start with something that might be OK in its original form, and any interventions that move it away from that are conductive to making it a dangerous alien or insane or just less inclined to be cooperative. Hence improvements in thingness of simulacra might help, while slicing around in their minds with the RLHF icepick might bring this unexpected opportunity to ruin.