I would love to see more people write this sort of thing, seems very high-value for newcomers to orient, and for existing researchers to see how people understand/misunderstand their arguments, and for the writers to accelerate the process of orienting, and generally for people to understand the generators behind each others’ work and communicate better.
Some notes...
The best way to solve this problem is to specify a utility function that, for the most part, avoids instrumentally convergent goals (power seeking, preventing being turned off).
I’m not sure that a good formulation of corrigibility would look like a utility function at all.
Same with the “crux” later on:
The best way to make progress on alignment is to write down a utility function for an AI that:
Generalizes
Is robust to large optimization pressure
Specifies precisely what we want
Outer alignment need not necessarily look like a utility function. (There are good arguments that it will behave like a utility function, but that should probably be a derived fact rather than an assumed fact, at the very least.) And even if it is, there’s a classic failure mode in which someone says “well, it should maximize utility, so we want a function of the form u(X)...” and they don’t notice that most of the interesting work is in figuring out what the utility function is over (i.e. what “X” is), not the actual utility function.
Also, while we’re talking about a “crux”...
In each section, we’ve laid out some cruxes, which are statements that support that frame on the core of the alignment problem. These cruxes are not necessary or sufficient conditions for a problem to be central.
Terminological nitpick: the term “crux” was introduced to mean something such that, if you changed your mind about that thing, it would also probably change your mind about whatever thing we’re talking about (in this case, centrality of a problem). A crux is not just a statement which supports a frame.
We can get a soft-optimization proposal that works to solve this problem (instead of having the AGI hard-optimize something safe).
Not sure if this is already something you know, but “soft-optimization” is a thing we know how to do. The catch, of course, is that mild optimization can only do things which are “not very hard” in the sense that they don’t require very much optimization pressure.
This problem being tractable relies on some form of the Natural Abstractions Hypothesis.
There is, ultimately, going to end up being a thing like “Human Values,” that can be pointed to and holds up under strong optimization pressure.
Note that whether human values or corrigibility or “what I mean” or some other direct alignment target is a natural abstraction is not strictly part of the pointers problem. Pointers problem is about pointing to latent concepts in general; whether a given system has an internal latent variable corresponding to “human values” specifically is a separate question.
Also, while tractability of the pointers problem does depend heavily on NAH, note that it’s still a problem (and probably an even more core and urgent one!) even if NAH turns out to be relatively weak.
Overall, the problem-summaries were quite good IMO.
Yeah I agree in retrospect about utility functions not being a good formulation of corrigibility, we phrased it like that because we spent some time thinking about the MIRI corrigibility paper, which uses a framing like this to make it concrete.
On outer alignment: I think if we have utility function over universe histories/destinies, then this is a sufficiently general framing that any outer alignment solution should be capable of being framed this way. Although it might not end up being the most natural framing.
On cruxes: Good point, we started off using it pretty much correctly but ended up abusing those sections. Oops.
On soft optimization: We talked a lot about quantilizers, the quantilizers paper is among my favorite papers. I’m not really convinced yet that the problems we would want an AGI to solve (in the near term) are in the “requires super high optimization pressure” category. But we did discuss how to improve the capabilities of quantilizers, by adjusting the level of quantilization based on some upper bound on the uncertainty about the goal on the local task.
On pointers problem: Yeah we kind of mushed together extra stuff into the pointers problem section, because this was how our discussion went. Someone did also argue that it would be more of a problem if NAH was weak, but overall I thought it would probably be a bad way to frame the problem if the NAH was weak.
I would love to see more people write this sort of thing, seems very high-value for newcomers to orient, and for existing researchers to see how people understand/misunderstand their arguments, and for the writers to accelerate the process of orienting, and generally for people to understand the generators behind each others’ work and communicate better.
Some notes...
I’m not sure that a good formulation of corrigibility would look like a utility function at all.
Same with the “crux” later on:
Outer alignment need not necessarily look like a utility function. (There are good arguments that it will behave like a utility function, but that should probably be a derived fact rather than an assumed fact, at the very least.) And even if it is, there’s a classic failure mode in which someone says “well, it should maximize utility, so we want a function of the form u(X)...” and they don’t notice that most of the interesting work is in figuring out what the utility function is over (i.e. what “X” is), not the actual utility function.
Also, while we’re talking about a “crux”...
Terminological nitpick: the term “crux” was introduced to mean something such that, if you changed your mind about that thing, it would also probably change your mind about whatever thing we’re talking about (in this case, centrality of a problem). A crux is not just a statement which supports a frame.
Not sure if this is already something you know, but “soft-optimization” is a thing we know how to do. The catch, of course, is that mild optimization can only do things which are “not very hard” in the sense that they don’t require very much optimization pressure.
Note that whether human values or corrigibility or “what I mean” or some other direct alignment target is a natural abstraction is not strictly part of the pointers problem. Pointers problem is about pointing to latent concepts in general; whether a given system has an internal latent variable corresponding to “human values” specifically is a separate question.
Also, while tractability of the pointers problem does depend heavily on NAH, note that it’s still a problem (and probably an even more core and urgent one!) even if NAH turns out to be relatively weak.
Overall, the problem-summaries were quite good IMO.
Yeah I agree in retrospect about utility functions not being a good formulation of corrigibility, we phrased it like that because we spent some time thinking about the MIRI corrigibility paper, which uses a framing like this to make it concrete.
On outer alignment: I think if we have utility function over universe histories/destinies, then this is a sufficiently general framing that any outer alignment solution should be capable of being framed this way. Although it might not end up being the most natural framing.
On cruxes: Good point, we started off using it pretty much correctly but ended up abusing those sections. Oops.
On soft optimization: We talked a lot about quantilizers, the quantilizers paper is among my favorite papers. I’m not really convinced yet that the problems we would want an AGI to solve (in the near term) are in the “requires super high optimization pressure” category. But we did discuss how to improve the capabilities of quantilizers, by adjusting the level of quantilization based on some upper bound on the uncertainty about the goal on the local task.
On pointers problem: Yeah we kind of mushed together extra stuff into the pointers problem section, because this was how our discussion went. Someone did also argue that it would be more of a problem if NAH was weak, but overall I thought it would probably be a bad way to frame the problem if the NAH was weak.
I would be curious if you have any good illustrations of alternatives to utility functions.