As an intuition pump, imagine a company that is run entirely by A/B tests for metrics that can be easily checked. This company would burn every resource it couldn’t measure — its code would become unmaintainable, its other infrastructure would crumble, it would use up goodwill with customers, it would make no research progress, it would become unable to hire, it would get on the wrong side of regulators.
It seems like part of this problem is easy-ish, and part is hard.
The easy part: seems like you can formally capture what resources are via average optimal value. A system which actually increased my average optimal value wrt the future seems quite helpful. Basically, this just an alternative statement of instrumental convergence—ceteris paribus, making sure I’m highly able to paint houses blue also probably means I can autonomously pursue my actual values.*
* This probably reads weird, but I don’t have time to go in depth on this right now. Happy to clarify more later.
But, average optimal value is itself inaccessible. It’s less inaccessible than eg my true moral values and desires, but it still requires reasoning about something in the world which cannot be directly observed. Furthermore, “average optimal value” relies on a notion of counterfactual that is itself an abstraction—“how well could (this person) achieve this other goal (which they won’t actually pursue)”. We’d have to pin down that abstraction, too.
I agree that if you had a handle on accessing average optimal value then you’d be making headway.
I don’t think it covers everything, since e.g. safety / integrity of deliberation / etc. are also important, and because instrumental values aren’t quite clean enough (e.g. even if AI safety was super easy these agents would only work on the version that was useful for optimizing values from the mixture used).
But my bigger Q is how to make headway on accessing average optimal value, and whether we’re able to make the problem easier by focusing on average optimal value.
It seems like part of this problem is easy-ish, and part is hard.
The easy part: seems like you can formally capture what resources are via average optimal value. A system which actually increased my average optimal value wrt the future seems quite helpful. Basically, this just an alternative statement of instrumental convergence—ceteris paribus, making sure I’m highly able to paint houses blue also probably means I can autonomously pursue my actual values.*
* This probably reads weird, but I don’t have time to go in depth on this right now. Happy to clarify more later.
But, average optimal value is itself inaccessible. It’s less inaccessible than eg my true moral values and desires, but it still requires reasoning about something in the world which cannot be directly observed. Furthermore, “average optimal value” relies on a notion of counterfactual that is itself an abstraction—“how well could (this person) achieve this other goal (which they won’t actually pursue)”. We’d have to pin down that abstraction, too.
I agree that if you had a handle on accessing average optimal value then you’d be making headway.
I don’t think it covers everything, since e.g. safety / integrity of deliberation / etc. are also important, and because instrumental values aren’t quite clean enough (e.g. even if AI safety was super easy these agents would only work on the version that was useful for optimizing values from the mixture used).
But my bigger Q is how to make headway on accessing average optimal value, and whether we’re able to make the problem easier by focusing on average optimal value.