The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
Goals are functions over the concepts in one’s internal ontology, yes. But having a concept for something doesn’t mean caring about it — your knowing what a “paperclip” is doesn’t make you a paperclip-maximizer.
The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.