I think alignment might well be very hard and the stakes are certainly very high, but I must say that I find this post only partly convincing. Here are some thoughts I had, reading the post:
In primates making them smarter probably mostly required scaling the cortex. Making the smarter version more aligned with IGF would have required a rewiring of the older brain parts. One of these is much harder to do than the other.
But in machines both alignment and capabilities will likely be learned with the same architecture, making it less likely that capabilities outstrip alignment for architecture reasons than in primates.
That there is a capabilities well is certainly true. But shouldn’t there be value wells as well? One might think “I do what is best for me” is a deeper well than “I do what is best for all” and more aligned with instrumental power seeking.
But “I do what is best for me” isn’t really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only “I do what is best for me” isn’t.
Is the structure underlying “do what is best for all” less simple and logical than arithmetic?
I often see it assumed that the ultimate values of an AGI will be kind of random. Some mesa-optimiser stumbled into during training. If this is true, than there is little optimisation pressure towards those values and it seems possible to train for “do what is best for all” instead.
That there is a capabilities well is certainly true. But shouldn’t there be value wells as well? One might think “I do what is best for me” is a deeper well than “I do what is best for all” and more aligned with instrumental power seeking.
Do you mean a wider well? Width (“how hard is it to hit this basin at all, such that you start to fall the rest of the way down the right slope?”) seems like the main property of interest.
But “I do what is best for me” isn’t really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only “I do what is best for me” isn’t.
“I do what is best for me” and “I do what is best for all” are both too underspecified to say much about; in particular, it’s very unclear to me in each case what is meant by “best”, and the vague English phrasing seems liable to mislead us, since e.g. it lends itself to equivocating between different meanings of “best”, it doesn’t provide an obvious path toward making the meaning more precise or unambiguous, and it suggests the relevant goals are simpler than they are (since the English words are short).
If you had a fully specified, unambiguous list of goals — the sorts of goals that could actually function as lines of code in a program selecting between actions — then they would end up looking like a giant space of things like:
Maximize confidence that 673 is a prime number
Maximize the number of approximately spherical configurations of granite
Maximize the number (how many carbon atoms in the universe are arranged in diamond lattices, plus (44.3 times the number of times an atomic replica of my reward button is pushed))
There would then be a relatively small subspace of goals that maximizes some function of nervous system states you’re currently confident are currently physically instantiated. But “I do what is best for all” makes it sound like “best” and “all” are simple concepts, and like there isn’t a huge space of different conceptions of “best” and different conceptions of “all”.
There exists a different tractable goal for every tractable function from ‘the configuration of all nervous systems in the universe’ to outcomes.
Indeed, there further exists a different tractable goal for every tractable function from physical systems to outcomes. Most obviously, different goals will focus on different notions of ‘agent’ or ‘person’ or ‘mind’—anything works fine here, from ‘all nervous systems’ to ‘all instantiated algorithms that mentally represent future scenes and decide which direction to walk based on the scenes they represent’. But also, a large number of goals will optimize some function of all aardvark elbows currently instantiated in the universe.
What I am talking about is the dimension from cooperation to conflict. I.e. jointly optimising the preferences of all interacting agents or optimising for one set of preferences at the expense of the preferences of the other involved agents.
This is a dimension any sufficiently intelligent agent that is trained on situations involving multi-agent interaction will learn. Even if it is only trained to maximize it’s own set of preferences. It’s a concept that is independent from the preferences of the agents in the single instances of interactions, so the definition of “best” is really not relevant at this level of abstraction.
It’s probably one of the most basic, the most general and one of only very few dimension of actions in multi-agent settings that is always salient.
That’s why I say that the two poles of that dimension in agent behavior are wells. (They are both very wide wells I would think. When I said “deep” I meant something more like “hard to dislodge from”.)
I think alignment might well be very hard and the stakes are certainly very high, but I must say that I find this post only partly convincing. Here are some thoughts I had, reading the post:
In primates making them smarter probably mostly required scaling the cortex. Making the smarter version more aligned with IGF would have required a rewiring of the older brain parts. One of these is much harder to do than the other.
But in machines both alignment and capabilities will likely be learned with the same architecture, making it less likely that capabilities outstrip alignment for architecture reasons than in primates.
That there is a capabilities well is certainly true. But shouldn’t there be value wells as well? One might think “I do what is best for me” is a deeper well than “I do what is best for all” and more aligned with instrumental power seeking.
But “I do what is best for me” isn’t really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only “I do what is best for me” isn’t.
Is the structure underlying “do what is best for all” less simple and logical than arithmetic?
I often see it assumed that the ultimate values of an AGI will be kind of random. Some mesa-optimiser stumbled into during training. If this is true, than there is little optimisation pressure towards those values and it seems possible to train for “do what is best for all” instead.
Do you mean a wider well? Width (“how hard is it to hit this basin at all, such that you start to fall the rest of the way down the right slope?”) seems like the main property of interest.
“I do what is best for me” and “I do what is best for all” are both too underspecified to say much about; in particular, it’s very unclear to me in each case what is meant by “best”, and the vague English phrasing seems liable to mislead us, since e.g. it lends itself to equivocating between different meanings of “best”, it doesn’t provide an obvious path toward making the meaning more precise or unambiguous, and it suggests the relevant goals are simpler than they are (since the English words are short).
If you had a fully specified, unambiguous list of goals — the sorts of goals that could actually function as lines of code in a program selecting between actions — then they would end up looking like a giant space of things like:
Maximize confidence that 673 is a prime number
Maximize the number of approximately spherical configurations of granite
Maximize the number (how many carbon atoms in the universe are arranged in diamond lattices, plus (44.3 times the number of times an atomic replica of my reward button is pushed))
There would then be a relatively small subspace of goals that maximizes some function of nervous system states you’re currently confident are currently physically instantiated. But “I do what is best for all” makes it sound like “best” and “all” are simple concepts, and like there isn’t a huge space of different conceptions of “best” and different conceptions of “all”.
There exists a different tractable goal for every tractable function from ‘the configuration of all nervous systems in the universe’ to outcomes.
Indeed, there further exists a different tractable goal for every tractable function from physical systems to outcomes. Most obviously, different goals will focus on different notions of ‘agent’ or ‘person’ or ‘mind’—anything works fine here, from ‘all nervous systems’ to ‘all instantiated algorithms that mentally represent future scenes and decide which direction to walk based on the scenes they represent’. But also, a large number of goals will optimize some function of all aardvark elbows currently instantiated in the universe.
What I am talking about is the dimension from cooperation to conflict. I.e. jointly optimising the preferences of all interacting agents or optimising for one set of preferences at the expense of the preferences of the other involved agents.
This is a dimension any sufficiently intelligent agent that is trained on situations involving multi-agent interaction will learn. Even if it is only trained to maximize it’s own set of preferences. It’s a concept that is independent from the preferences of the agents in the single instances of interactions, so the definition of “best” is really not relevant at this level of abstraction.
It’s probably one of the most basic, the most general and one of only very few dimension of actions in multi-agent settings that is always salient.
That’s why I say that the two poles of that dimension in agent behavior are wells. (They are both very wide wells I would think. When I said “deep” I meant something more like “hard to dislodge from”.)