In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values.
It seems to me that while the fixed point conception here doesn’t uniquely determine a learning strategy, it should be possible to uniquely determine that strategy by building it into the training data.
In particular, if you have a base level of “reality” like the P_0 you describe, then it should be possible to train a model first on this reality, then present it with training scenarios that start by working directly on the “verifiable reality” subset, then build to “one layer removed” and so on.
My (very weak) shoulder-John says that just because this “feels like it converges” doesn’t actually make any guarantees about convergence, but since P_0, P_1, etc. are very well specified it feels like a more approachable problem to try to analyze a specific basis of convergence. If one gets a basis of convergence, AND an algorithm for locating that basis of convergence, that seems to me sufficient for object-level honesty, which would be a major result.
I’m curious if you disagree with:
The problem of choosing a basis of convergence is tractable (relative to alignment research in general)
The problem of verifying that AI is in the basis of convergence is tractable
Training an AI into a chosen basis of convergence could enforce that AI to be honest on the object level when object level honesty is available
Object level honesty is not a major result, for example because not enough important problems can be reduced to object level or because it is already achievable
Writing that out, I am guessing that 2 may be a disagreement that I still disagree with (e.g. you may think it is not tractable), and 3 may contain a disagreement that is compelling and hard to resolve (e.g. you may think we cannot verify which basis of convergence satisfies our honesty criteria—my intuition is that this would require not having a basis of convergence at all).
I am confused about the opening of your analysis:
It seems to me that while the fixed point conception here doesn’t uniquely determine a learning strategy, it should be possible to uniquely determine that strategy by building it into the training data.
In particular, if you have a base level of “reality” like the P_0 you describe, then it should be possible to train a model first on this reality, then present it with training scenarios that start by working directly on the “verifiable reality” subset, then build to “one layer removed” and so on.
My (very weak) shoulder-John says that just because this “feels like it converges” doesn’t actually make any guarantees about convergence, but since P_0, P_1, etc. are very well specified it feels like a more approachable problem to try to analyze a specific basis of convergence. If one gets a basis of convergence, AND an algorithm for locating that basis of convergence, that seems to me sufficient for object-level honesty, which would be a major result.
I’m curious if you disagree with:
The problem of choosing a basis of convergence is tractable (relative to alignment research in general)
The problem of verifying that AI is in the basis of convergence is tractable
Training an AI into a chosen basis of convergence could enforce that AI to be honest on the object level when object level honesty is available
Object level honesty is not a major result, for example because not enough important problems can be reduced to object level or because it is already achievable
Writing that out, I am guessing that 2 may be a disagreement that I still disagree with (e.g. you may think it is not tractable), and 3 may contain a disagreement that is compelling and hard to resolve (e.g. you may think we cannot verify which basis of convergence satisfies our honesty criteria—my intuition is that this would require not having a basis of convergence at all).