I still feel like you’re missing something important here.
For instance… in the explainability factor, you measure “the average deviation of π from the actions favored by the action-value function qμ of μ”, using the formula
. But why this particular formula? Why not take the log of qμ first, or use 3+maxaqμ(st,a) in the denominator? Indeed, there’s a strong argument to be made this formula is a bad choice: the value function qμ is invariant under multiplying by a scalar or adding a constant (i.e. these operations leave the preferences encoded by qμ unchanged), yet this value is not invariant to adding a constant to qμ. So we could change our representation of the “goal” to which we’re comparing, in a way which should still represent the same goal, yet the supposed answer to “how well does this goal explain the system’s behavior” changes.
Don’t get too caught up on this one specific issue—there’s a broader problem I’m pointing to here. The problem is with trying to use arbitrary formulas to represent intuitive concepts. If multiple non-equivalent formulas seem like similarly-plausible quantifications of an intuitive concept, then at least one of them is wrong; we have not yet understood the intuitive concept well enough to correctly quantify it. Unless every degree of freedom in the formula is nailed down (up to mathematical equivalence), we haven’t actually quantified the intuitive concept, we’ve just come up with a proxy.
That’s what these numbers are: they’re not sufficient statistics, they’re proxies, in exactly the same sense that “how often a human pushes an approval button” is a proxy for how good an AI’s actions are. And they will break down, as proxies always do.
That puts this part in a somewhat different perspective:
Honestly, I don’t think there’s an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goal-directedness, carving more grounded questions:
Given a question about goals and goal-directedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what’s missing.
Are my formula adequate formalization of the intuitive properties?
I claim it makes more sense to word these questions as:
Given a question about goals and goal-directedness, are these proxies enough to frame and study this question?
Are these proxies adequate formalizations of the intuitive properties?
The answer to the first question may sometimes be “yes”. The answer to the second is definitely “no”; these are proxies, and they absolutely will not hold up if we try to put optimization pressure on them. Goodhart’s law will kick in. For instance, tying back to the earlier example, at some point there may be a degree of freedom in how the goal is represented, without changing the substantive meaning of the goal (e.g. adding a constant to qμ). Normally, that won’t be much of a problem, but if we put optimization pressure on it, then we’ll end up with some big constant added to μ in order to change the explainability factor, and then the proxy will break down—the explainability factor will cease to be a good measure of explainability.
To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with him.
The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it’s misleading to call that the solution.
So I’ve been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
I still feel like you’re missing something important here.
For instance… in the explainability factor, you measure “the average deviation of π from the actions favored by the action-value function qμ of μ”, using the formula
predEg(π,μ,s)=1TT∑t=0maxaqμ(st,a)−qμ(st,actionπ)maxaqμ(st,a)
. But why this particular formula? Why not take the log of qμ first, or use 3+maxaqμ(st,a) in the denominator? Indeed, there’s a strong argument to be made this formula is a bad choice: the value function qμ is invariant under multiplying by a scalar or adding a constant (i.e. these operations leave the preferences encoded by qμ unchanged), yet this value is not invariant to adding a constant to qμ. So we could change our representation of the “goal” to which we’re comparing, in a way which should still represent the same goal, yet the supposed answer to “how well does this goal explain the system’s behavior” changes.
Don’t get too caught up on this one specific issue—there’s a broader problem I’m pointing to here. The problem is with trying to use arbitrary formulas to represent intuitive concepts. If multiple non-equivalent formulas seem like similarly-plausible quantifications of an intuitive concept, then at least one of them is wrong; we have not yet understood the intuitive concept well enough to correctly quantify it. Unless every degree of freedom in the formula is nailed down (up to mathematical equivalence), we haven’t actually quantified the intuitive concept, we’ve just come up with a proxy.
That’s what these numbers are: they’re not sufficient statistics, they’re proxies, in exactly the same sense that “how often a human pushes an approval button” is a proxy for how good an AI’s actions are. And they will break down, as proxies always do.
That puts this part in a somewhat different perspective:
I claim it makes more sense to word these questions as:
Given a question about goals and goal-directedness, are these proxies enough to frame and study this question?
Are these proxies adequate formalizations of the intuitive properties?
The answer to the first question may sometimes be “yes”. The answer to the second is definitely “no”; these are proxies, and they absolutely will not hold up if we try to put optimization pressure on them. Goodhart’s law will kick in. For instance, tying back to the earlier example, at some point there may be a degree of freedom in how the goal is represented, without changing the substantive meaning of the goal (e.g. adding a constant to qμ). Normally, that won’t be much of a problem, but if we put optimization pressure on it, then we’ll end up with some big constant added to μ in order to change the explainability factor, and then the proxy will break down—the explainability factor will cease to be a good measure of explainability.
To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with him.
The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it’s misleading to call that the solution.
So I’ve been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
I’m working on writing it up properly, should have a post at some point.
EDIT: it’s up.