I think it’s very useful to have a faithful chain of thought, but I also think all isn’t lost without it.
Here’s my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn’t intend or foresee. This is the classical agent foundations’ worry, and I think it’s valid. But we won’t do infinite optimization all in one go; we’ll have a human-level agent that’s on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be “corribible” as long as it’s approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI
I think we are in agreement. I see a path to successful technical alignment through agentized LLMs/foundation models. The brief take is at Agentized LLMs will change the alignment landscape, and there’s more on the several overlapping alignment techniques in Internal independent review for language model agent alignment and reasons to think they’ll advance rapidly at Capabilities and alignment of LLM cognitive architectures.
I think it’s very useful to have a faithful chain of thought, but I also think all isn’t lost without it.
Here’s my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn’t intend or foresee. This is the classical agent foundations’ worry, and I think it’s valid. But we won’t do infinite optimization all in one go; we’ll have a human-level agent that’s on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be “corribible” as long as it’s approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI