I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it’s unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the “useful” assumption.
I’d be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn’t involve failing and reevaluating high-level strategy every now and then.
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can’t expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it’s unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the “useful” assumption.
I’d be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn’t involve failing and reevaluating high-level strategy every now and then.
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can’t expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!