The following is a naive attempt to write a formal, sufficient condition for a search process to be “not safe with respect to inner alignment”.
Definitions:
D: a distribution of labeled examples. Abuse of notation: I’ll assume that we can deterministically sample a sequence of examples from D.
L: a deterministic supervised learning algorithm that outputs an ML model.L has access to an infinite sequence of training examples that is provided as input; and it uses a certain “amount of compute” c that is also provided as input. If we operationalize L as a Turing Machine then c can be the number of steps that L is simulated for.
L(D,c): The ML model that L outputs when given an infinite sequence of training examples that was deterministically sampled from D; and c as the “amount of compute” that L uses.
aL,D(c): The accuracy of the model L(D,c) over D (i.e. the probability that the model L(D,c) will be correct for a random example that is sampled from D).
Finally, we say that the learning algorithm LFails The Basic Safety Test with respect to the distribution D if the accuracy aL,D(c) is not weakly increasing as a function of c.
Note: The “not weakly increasing” condition seems too strict weak. It should probably be replaced with a stricter condition, but I don’t know what that stricter condition should look like.
Brainstorming
The following is a naive attempt to write a formal, sufficient condition for a search process to be “not safe with respect to inner alignment”.
Definitions:
D: a distribution of labeled examples. Abuse of notation: I’ll assume that we can deterministically sample a sequence of examples from D.
L: a deterministic supervised learning algorithm that outputs an ML model.L has access to an infinite sequence of training examples that is provided as input; and it uses a certain “amount of compute” c that is also provided as input. If we operationalize L as a Turing Machine then c can be the number of steps that L is simulated for.
L(D,c): The ML model that L outputs when given an infinite sequence of training examples that was deterministically sampled from D; and c as the “amount of compute” that L uses.
aL,D(c): The accuracy of the model L(D,c) over D (i.e. the probability that the model L(D,c) will be correct for a random example that is sampled from D).
Finally, we say that the learning algorithm L Fails The Basic Safety Test with respect to the distribution D if the accuracy aL,D(c) is not weakly increasing as a function of c.
Note: The “not weakly increasing” condition seems too
strictweak. It should probably be replaced with a stricter condition, but I don’t know what that stricter condition should look like.