If the process of self-improving AIs like described in an simple article by Tim Urban (below) is mastered, then the AI alignment problem is solved: “The idea is that we’d build a computer whose two-THREE major skills would be doing research on AI, ON ETHICS, and coding changes into itself—allowing it to not only learn but to improve its own architecture. We’d teach computers to be computer scientists so they could bootstrap their own development. And that would be their main job—figuring out how to make themselves smarter and ALIGNED”
In caps: parts to add for alignment
Link to the article: https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html
What ethics? What is ethics? Does the machine mind settle on an ethics that insists on wiping out humanity? I think the intuition of looking for a star able attractor state for the AI to pursue is a good one. But “ethics” is not a sufficiently coherent and objective concept to serve as that target.
I would say this has causality backwards. In other words, one of the ways of solving the AI alignment problem is figuring out how to master the plausibly extremely complex process necessary to successfully implement a strategy that can be pointed to in a simple article.
As I understand it, the vast majority of the difficulty is in figuring out what the second goal in that list actually is, and how to make an AI care about it. Keep in mind that in so many cases we humans are still arguing about the same questions, answers, and frameworks that we’ve been debating for millennia.
This overall tactic can work well for problems that are difficult to solve, but easy (or at least possible) to test a solution.
I don’t think alignment is such a thing. At least I haven’t seen any proposals for measuring “how aligned” a system is.
I have seen many. the only ones that seem to have any chance of, after heavy modification, becoming a seed of something that holds up, are QACI and open agency+boundaries. both have big holes that make attempting to implement them as-is guaranteed to fail.
Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn’t make anyone immortal or solve any resource-allocation pains that humans have.
do you know of any other sketches of how to measure that are reasonably close to mechanically specified?
Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.
It’s not clear if this ends up working as intended, but there are proposals to that effect.
For example, “Safety without alignment”, https://arxiv.org/abs/2303.00752 proposes to explore a path which is closely related to what you are suggesting.
(It would be helpful to have a link to Tim Urban’s article.)
Thanks for including the link in your edit.
One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).
An arbitrary goal or value is unlikely to persist (this is why the “classical formulation of alignment problem” is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That’s the main obstacle to asking AIs to research and implement this on their own as they get smarter and smarter. The question is always: “why would AIs keep caring about this?”
But there might be “natural properties” (“natural” values and goals) which AIs might want to preserve because of their own reasons (because they might be interested in the world around them not being utterly destroyed, because they might be interested in existing in a reasonably comfortable and safe society, and so on). With such “natural properties” it might be easier to delegate it to AIs to research, implement, and maintain those properties, because AIs might have intrinsic reasons to keep caring even through drastic changes.
And then, of course, the question is: can one formulate such “natural properties” that a reasonable level of AI safety for humans would be a corollary to those “natural properties”?
But this is why “alignment” might be a terminology which is less than optimal (because this terminology tends to focus our attention at the human-oriented properties and values which are unlikely to be invariant with respect to recursive self-improvements on their own, although they can be corollaries of properties which might be feasible to keep invariant).
Of course, there can be different approaches to finding those “natural properties” and making sure they hold through self-improvements; the paper I linked is just one of many of such possible approaches.