I love this idea. However, I’m a little hesitant about one aspect of it. I imagine that any proof of the infeasibility of alignment will look less like the ignition calculations and more like a climate change model. It might go a long way to convincing people on the fence, but unless it is ironclad and has no opposition, it will likely be dismissed as fearmongering by the same people who are already skeptical about misalignment. More important than the proof itself is the ability to convince key players to take the concerns seriously. How far is that goal advanced by your ignition proof? Maybe a ton, I don’t know.
My point is that I expect an ignition proof to be an important tool in the struggle that is already ongoing, rather than something which brings around a state change.
Models are simulations; if it’s a proof, it’s not just a model. A proof is mathematical truth made word; it is, upon inspection and after sufficient verification, self-evident and as sure as any of we assume any of the self-evident axioms it rests on to be. The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can’t imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
Well, architecture specific is something: maybe some different architectures other than LLMs/ANNs are more amenable to alignment, and that’s that. Or it could be a more general result about e.g. what can be achieved with SGD. Though I expect there may also be a general proof altogether, akin to the undecidability of the halting problem.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
I love this idea. However, I’m a little hesitant about one aspect of it. I imagine that any proof of the infeasibility of alignment will look less like the ignition calculations and more like a climate change model. It might go a long way to convincing people on the fence, but unless it is ironclad and has no opposition, it will likely be dismissed as fearmongering by the same people who are already skeptical about misalignment.
More important than the proof itself is the ability to convince key players to take the concerns seriously. How far is that goal advanced by your ignition proof? Maybe a ton, I don’t know.
My point is that I expect an ignition proof to be an important tool in the struggle that is already ongoing, rather than something which brings around a state change.
Models are simulations; if it’s a proof, it’s not just a model. A proof is mathematical truth made word; it is, upon inspection and after sufficient verification, self-evident and as sure as any of we assume any of the self-evident axioms it rests on to be. The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can’t imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
Well, architecture specific is something: maybe some different architectures other than LLMs/ANNs are more amenable to alignment, and that’s that. Or it could be a more general result about e.g. what can be achieved with SGD. Though I expect there may also be a general proof altogether, akin to the undecidability of the halting problem.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.