Maybe I’m just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I’ve had in the back of my mind for a while now.
You think that an intelligence that doesn’t-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
In particular, this reads to me like the “unstable alignment” paradigm I wrote about a while ago.
You have an agent which is consequentialist enough to be useful, but not so consequentialist that it’ll do things like spontaneously notice conflicts in the set of corrigible behaviors you’ve asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It’s possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who’s an accountant for the Nazis for 40 years and doesn’t think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it’s dangerous without interfering with it in cases where it’s essential for capabilities.
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it’s unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the “useful” assumption.
I’d be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn’t involve failing and reevaluating high-level strategy every now and then.
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can’t expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
Maybe I’m just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I’ve had in the back of my mind for a while now.
In particular, this reads to me like the “unstable alignment” paradigm I wrote about a while ago.
You have an agent which is consequentialist enough to be useful, but not so consequentialist that it’ll do things like spontaneously notice conflicts in the set of corrigible behaviors you’ve asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It’s possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who’s an accountant for the Nazis for 40 years and doesn’t think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it’s dangerous without interfering with it in cases where it’s essential for capabilities.
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it’s unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the “useful” assumption.
I’d be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn’t involve failing and reevaluating high-level strategy every now and then.
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can’t expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!