Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.
Perhaps I’m being unfair, but I’m not convinced that you’re not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.
Two agents with the same ontology and very different purposes would behave in very different ways.
I don’t understand this objection. I’m not making any claim isomorphic to “two agents with the same ontology would have the same goals”. It sounds like maybe you think I’m arguing that if we can make the AI’s world-model human-like, it would necessarily also be aligned? That’s not my point at all.
The motivation is outlined at the start of 1A: I’m saying that if we can learn how to interpret arbitrary advanced world-models, we’d be able to more precisely “aim” our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI’s aligned/corrigible behavior.
Isn’t a special case of aiming at any target we want the goals we would want it to have? And whatever goals we’d want it to have would be informed by our ontology? So what I’m saying is I think there’s a case where the generality of your claim breaks down.
The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it’s relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
The robust values hypothesis from DragonGod is worth looking at, too.
From the link below, I’ll quote:
Consider the following hypothesis:
There exists a “broad basin of attraction” around a privileged subset of human values[1] (henceforth “ideal values”)
The larger the basin the more robust values are
Example operationalisations[2] of “privileged subset” that gesture in the right direction:
Minimal set that encompasses most of the informational content of “benevolent”/”universal”[3] human values
The “minimal latents” of “benevolent”/”universal” human values
Example operationalisations of “broad basin of attraction” that gesture in the right direction:
A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3
Larger neighbourhood → larger basin
Said subset is a “naturalish” abstraction
The more natural the abstraction, the more robust values are
Example operationalisations of “naturalish abstraction”
The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged → more natural
Most efficient representations of our universe contain a simple embedding of the subset
Simpler embeddings → more natural
Points within this basin are suitable targets for optimisation
The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
Example operationalisations of “suitable targets for optimisation”:
Optimisation of this target is existentially safe[4]
More strongly, we would be “happy” (where we fully informed) for the system to optimise for these points.
This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.
Here’s a link below for the robust values hypothesis:
Now this is admittedly very different from the thesis that value is complex and fragile.
I disagree. The fact that some concept is very complicated doesn’t mean it won’t be necessarily represented in any advanced AGI’s ontology. Humans’ psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also “complex and fragile” concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned.
Not that I necessarily expect “human values” specifically to actually be a natural abstraction — an indirect pointer at “moral philosophy”/DWIM/corrigibility seem much more plausible and much less complex.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it’s relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
If this is the case, my concern seems yet more warranted, as this is hoping we won’t suffer a false positive alignment scheme that looks like it could work but won’t. Given the his cost of getting things wrong, we should minimize false positive risks which means not pursuing some ideas because the risk if they are wrong is too high.
Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.
Perhaps I’m being unfair, but I’m not convinced that you’re not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.
I don’t understand this objection. I’m not making any claim isomorphic to “two agents with the same ontology would have the same goals”. It sounds like maybe you think I’m arguing that if we can make the AI’s world-model human-like, it would necessarily also be aligned? That’s not my point at all.
The motivation is outlined at the start of 1A: I’m saying that if we can learn how to interpret arbitrary advanced world-models, we’d be able to more precisely “aim” our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI’s aligned/corrigible behavior.
Isn’t a special case of aiming at any target we want the goals we would want it to have? And whatever goals we’d want it to have would be informed by our ontology? So what I’m saying is I think there’s a case where the generality of your claim breaks down.
Goals are functions over the concepts in one’s internal ontology, yes. But having a concept for something doesn’t mean caring about it — your knowing what a “paperclip” is doesn’t make you a paperclip-maximizer.
The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it’s relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
The robust values hypothesis from DragonGod is worth looking at, too.
From the link below, I’ll quote:
This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.
Here’s a link below for the robust values hypothesis:
https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust
I disagree. The fact that some concept is very complicated doesn’t mean it won’t be necessarily represented in any advanced AGI’s ontology. Humans’ psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also “complex and fragile” concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned.
Not that I necessarily expect “human values” specifically to actually be a natural abstraction — an indirect pointer at “moral philosophy”/DWIM/corrigibility seem much more plausible and much less complex.
Sorry for misrepresenting your views.
If this is the case, my concern seems yet more warranted, as this is hoping we won’t suffer a false positive alignment scheme that looks like it could work but won’t. Given the his cost of getting things wrong, we should minimize false positive risks which means not pursuing some ideas because the risk if they are wrong is too high.