If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
Sequence 1:
The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent’s values drift away from what we intended.
Sequence 2:
The agent develops diamond-shard
The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.
Not the OP but this jumped out at me:
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
Sequence 1:
The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent’s values drift away from what we intended.
Sequence 2:
The agent develops diamond-shard
The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
(agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.