Two agents with the same ontology and very different purposes would behave in very different ways.
I don’t understand this objection. I’m not making any claim isomorphic to “two agents with the same ontology would have the same goals”. It sounds like maybe you think I’m arguing that if we can make the AI’s world-model human-like, it would necessarily also be aligned? That’s not my point at all.
The motivation is outlined at the start of 1A: I’m saying that if we can learn how to interpret arbitrary advanced world-models, we’d be able to more precisely “aim” our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI’s aligned/corrigible behavior.
Isn’t a special case of aiming at any target we want the goals we would want it to have? And whatever goals we’d want it to have would be informed by our ontology? So what I’m saying is I think there’s a case where the generality of your claim breaks down.
The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
I don’t understand this objection. I’m not making any claim isomorphic to “two agents with the same ontology would have the same goals”. It sounds like maybe you think I’m arguing that if we can make the AI’s world-model human-like, it would necessarily also be aligned? That’s not my point at all.
The motivation is outlined at the start of 1A: I’m saying that if we can learn how to interpret arbitrary advanced world-models, we’d be able to more precisely “aim” our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI’s aligned/corrigible behavior.
Isn’t a special case of aiming at any target we want the goals we would want it to have? And whatever goals we’d want it to have would be informed by our ontology? So what I’m saying is I think there’s a case where the generality of your claim breaks down.
Goals are functions over the concepts in one’s internal ontology, yes. But having a concept for something doesn’t mean caring about it — your knowing what a “paperclip” is doesn’t make you a paperclip-maximizer.
The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.