I like these ideas. Personally, I think a kitchen-sink approach to corrigibility is the way to go.
Some questions and comments:
Can the behavior of a sufficiently smart and reflective agent which uses utility aggregation with sweetening and effort penalties be modeled as a non-corrigible / utility-maximizing agent with a more complicated utility function? What would such a utility function look like, if so? Does constructing such a model require drawing the boundaries around the agent differently (perhaps to include humans within), or otherwise require that the agent itself has a somewhat contrived view / ontology related to its own sense of “self”?
You cited a bunch of Russell’s work, but I’d be curious for a more nuts-and-bolts analysis about how your ideas relate and compare to CIRL specifically.
Is utility aggregation related to geometric rationality in any way? The idea of aggregating utilities across possible future selves seems philosophically similar.
Yes, you can think of it as having a non-corrigible complicated utility function. The relevant utility function is the ‘aggregated utilities’ defined in section 2. I think ‘corrigible’ vs ‘non-corrigible’ is slightly verbal, since it depends on how you define ‘utility’, but the non-verbal question is whether the resulting AI is safer.
Good idea, this is on my agenda!
Looking forward to reading up on geometric rationality in detail. On a quick first pass, looks like geometric rationality is a bit different because it involves deviating from axioms of VNM rationality by using random sampling. By contrast, utility aggregation is consistent with VNM rationality, because it just replaces the ordinary utility function with aggregated utility
I like these ideas. Personally, I think a kitchen-sink approach to corrigibility is the way to go.
Some questions and comments:
Can the behavior of a sufficiently smart and reflective agent which uses utility aggregation with sweetening and effort penalties be modeled as a non-corrigible / utility-maximizing agent with a more complicated utility function? What would such a utility function look like, if so? Does constructing such a model require drawing the boundaries around the agent differently (perhaps to include humans within), or otherwise require that the agent itself has a somewhat contrived view / ontology related to its own sense of “self”?
You cited a bunch of Russell’s work, but I’d be curious for a more nuts-and-bolts analysis about how your ideas relate and compare to CIRL specifically.
Is utility aggregation related to geometric rationality in any way? The idea of aggregating utilities across possible future selves seems philosophically similar.
Thanks for reading!
Yes, you can think of it as having a non-corrigible complicated utility function. The relevant utility function is the ‘aggregated utilities’ defined in section 2. I think ‘corrigible’ vs ‘non-corrigible’ is slightly verbal, since it depends on how you define ‘utility’, but the non-verbal question is whether the resulting AI is safer.
Good idea, this is on my agenda!
Looking forward to reading up on geometric rationality in detail. On a quick first pass, looks like geometric rationality is a bit different because it involves deviating from axioms of VNM rationality by using random sampling. By contrast, utility aggregation is consistent with VNM rationality, because it just replaces the ordinary utility function with aggregated utility