I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can’t imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
Well, architecture specific is something: maybe some different architectures other than LLMs/ANNs are more amenable to alignment, and that’s that. Or it could be a more general result about e.g. what can be achieved with SGD. Though I expect there may also be a general proof altogether, akin to the undecidability of the halting problem.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can’t imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
Well, architecture specific is something: maybe some different architectures other than LLMs/ANNs are more amenable to alignment, and that’s that. Or it could be a more general result about e.g. what can be achieved with SGD. Though I expect there may also be a general proof altogether, akin to the undecidability of the halting problem.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.