(Maybe they didn’t recognize the possibility of inventing new super-duper-heat-resistant ceramic tiles, or whatever.) And then they would wind up overly pessimistic.
Basically, this is what I think happened to AI alignment, just replace ridiculously good heat resistant tiles with Pretraining from Human Feedback and the analogy works here.
It wasn’t inevitable or even super likely that this would happen, or that we could have an alignment goal that gets better with capabilities by default, but we found one, and this makes me way more optimistic on alignment than I used to be.
Basically, this is what I think happened to AI alignment, just replace ridiculously good heat resistant tiles with Pretraining from Human Feedback and the analogy works here.
It wasn’t inevitable or even super likely that this would happen, or that we could have an alignment goal that gets better with capabilities by default, but we found one, and this makes me way more optimistic on alignment than I used to be.
I disagree but won’t argue here. IMO it’s off-topic.