Overall, I was pretty impressed by this; there were several points where I thought “sure, that would be nice, but obstacle X,” and then the next section brought up obstacle X.
I remain sort of unconvinced that utility functions are the right type signature for this sort of thing, but I do feel convinced that “we need some sort of formal synthesis process, and a possible end product of that is a utility function.”
That is, most of the arguments I see for ‘how a utility function could work’ go through some twisted steps. Suppose I’m trying to build a robot, and I want it to be corrigible, and I have a corrigibility detector whose type is ‘decision process’ to ‘score’. I need to wrap that detector with a ‘world state’ to ‘decision process’ function and a ‘score’ to ‘utility’ function, and then I can hand it off to a robot that does a ‘decision process’ to ‘world state’ prediction and optimizes utility. If the robot’s predictive abilities are superhuman, it can trace out whatever weird dependencies I couldn’t see; if they’re imperfect, then each new transformation provides another opportunity for errors to creep in. And it may be the case that this is a core part of reflective stability (because if you map through world-histories you bring objective reality into things in a way that will be asymptotically stable with increasing intelligence) that doesn’t have another replacement.
I do find myself worrying that embedded agency will require dropping utility functions in a deep way that ends up connected to whether or not this agenda will work (or which parts of it will work), but remain optimistic that you’ll find out something useful along the way and have that sort of obstacle in mind as you’re working on it.
Overall, I was pretty impressed by this; there were several points where I thought “sure, that would be nice, but obstacle X,” and then the next section brought up obstacle X.
I remain sort of unconvinced that utility functions are the right type signature for this sort of thing, but I do feel convinced that “we need some sort of formal synthesis process, and a possible end product of that is a utility function.”
That is, most of the arguments I see for ‘how a utility function could work’ go through some twisted steps. Suppose I’m trying to build a robot, and I want it to be corrigible, and I have a corrigibility detector whose type is ‘decision process’ to ‘score’. I need to wrap that detector with a ‘world state’ to ‘decision process’ function and a ‘score’ to ‘utility’ function, and then I can hand it off to a robot that does a ‘decision process’ to ‘world state’ prediction and optimizes utility. If the robot’s predictive abilities are superhuman, it can trace out whatever weird dependencies I couldn’t see; if they’re imperfect, then each new transformation provides another opportunity for errors to creep in. And it may be the case that this is a core part of reflective stability (because if you map through world-histories you bring objective reality into things in a way that will be asymptotically stable with increasing intelligence) that doesn’t have another replacement.
I do find myself worrying that embedded agency will require dropping utility functions in a deep way that ends up connected to whether or not this agenda will work (or which parts of it will work), but remain optimistic that you’ll find out something useful along the way and have that sort of obstacle in mind as you’re working on it.