There is no such disagreement, you just can’t test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.
There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.
What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague “LLMs are a lot like human uploads”. And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.
I don’t want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it’s plausibility, as formally as possible.
There is no such disagreement, you just can’t test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.
There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.
What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague “LLMs are a lot like human uploads”. And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.
I don’t want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it’s plausibility, as formally as possible.