Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.