I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.
I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.