I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system’s thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by “reading its thoughts”.
But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it’s a way to squeeze extra capability out of a set of base cognitive capacities.
Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded “System 2” thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.
This is just restating stuff I’ve said elsewhere, but I’m trying to refine the model, and work through how well it might work if you couldn’t apply any external reasoning oversight, and little to no interpretability. It’s definitely bad for the odds of success, but not necessarily crippling. I think.
This needs more thought. I’m working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.
I can imagine other AI systems that are trained differently, and I would be more worried about those.
That’s what I meant by current AI understanding our intentions possibly better than future AI.
This is an excellent point.
While LLMs seem (relatively) safe, we may very well blow right on by them soon.
I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system’s thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by “reading its thoughts”.
But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it’s a way to squeeze extra capability out of a set of base cognitive capacities.
Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded “System 2” thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.
This is just restating stuff I’ve said elsewhere, but I’m trying to refine the model, and work through how well it might work if you couldn’t apply any external reasoning oversight, and little to no interpretability. It’s definitely bad for the odds of success, but not necessarily crippling. I think.
This needs more thought. I’m working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
Did you mean something different than “AIs understand our intentions” (e.g. maybe you meant that humans can understand the AI’s intentions?).
I think future more powerful AIs will surely be strictly better at understanding what humans intend.
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.
I can imagine other AI systems that are trained differently, and I would be more worried about those.
That’s what I meant by current AI understanding our intentions possibly better than future AI.