johnswentworth comments on A shot at the diamond-alignment problem

johnswentworth 6 Oct 2022 20:33 UTC
LW: 4 AF: 3
2
AF
Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
- Ulisse Mini 7 Oct 2022 15:18 UTC
  5 points
  0
  Parent
  I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
  
  On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
  
  (You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
  What links here?
  - [ASoT] Reflectivity in Narrow AI by Ulisse Mini (21 Nov 2022 0:51 UTC; 6 points)
- TurnTrout 7 Oct 2022 16:41 UTC
  LW: 3 AF: 3
  1
  AF Parent
  (agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)
- cfoster0 6 Oct 2022 20:45 UTC
  1 point
  0
  Parent
  Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.