jacquesthibs comments on jacquesthibs’s Shortform

jacquesthibs 7 Dec 2022 18:51 UTC
1 point
0
It seems that Jan Leike mentions something similar in his “why I’m optimistic about our alignment approach” post.
The model can be “narrower.” It doesn’t need to understand biology, physics, or human society that well. In practice we’d probably fine-tune from an LLM that does understand all of those things, but we could apply some targeted brain damage to the model as a safety precaution. More generally, the model only has to exceed human-level in a few domains, while it can be worse than humans in most others.