>It seems like all of the many correct answers to what X would’ve wanted might not include the AGI killing everyone.
Yes, but if it wants to kill everyone it would pick one which does. The space “all possible actions” also contains some friendly actions.
>Wrt the continuity property, I think Max Harm’s corrigibility proposal has that
I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.
I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.
JuliaHP
The step from “tell AI to do Y” to “AI does Y” is a big part of the entire alignment problem. The reasons chatbots might seem aligned in this sense is that the thing you ask for often lives in a continuous space, and when not too strong optimization pressure is applied, when you ask for Y, Y+epsilon is good enough. This ceases to be the case when your Y is complicated and high optimization pressure is applied, UNLESS you can find a Y which has a strong continuity property in the sense you care about, which I am unaware of anyone who knows how to do.
Not to mention that “Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do” does not narrow down behaviour to a small enough space. There will be many to you reasonable looking interpretations, many of which will allow for satisfaction, while still allowing the AI to kill everyone.
While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.
My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.
But given an ASI with such a world model, I don’t see how one would specify properties such as “hey please don’t hyperoptimize squiggles or goodhart this property”. Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to “just solve alignment/corrigbility/safe-superintelligent-behaviour”.
Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says “X”, “X” is what the AI actually does? See this post.
If you are going to prove vague things about your AI and have it be any use at all, you’d want to prove properties in the style of “this AI has the kind of ‘cognition/mind’ for which it is ‘beneficial for the user’ to have running than not” and “this AI’s ‘cognition/mind’ lies in an ‘attractor space’ where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways”.
For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI’s internals such that it is a safe computation to run, never mind formally specifying it. I don’t have, and haven’t found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be ‘beneficial to the user’, or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we’re not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.
I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in “safety ranking”. It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.
You can totally have something which is trying to kill humanity in this framework though. Imagine something in the style of chaos-GPT, locally agentic & competent enough to use state-of-the-art AI biotech tools to synthesize dangerous viruses or compounds to release into the atmosphere. (note that In this example the critical part is the narrow-AI biotech tools, not the chaos-agent)
You don’t need solutions to embedded agency, goal-content integrity & the like to build this. It is easier to build and is earlier in the tech-tree than crisp maximizers. It will not be stable enough to coherently take over the lightcone. Just coherent enough to fold some proteins and print them.
But why would anyone do such a stupid thing?
Unless I misunderstand the confusion, a useful line of thought which might resolve some things:
Instead of analyzing whether you yourself are conscious or not, analyze what is causally upstream of your mind thinking that you are conscious, or your body uttering the words “I am conscious”.
Similarly you could analyze whether an upload would would think similar thoughts, or say similar things. What about a human doing manual computations? What about a pure mathematical object?
A couple of examples of where to go from there:
- If they have the same behavior, perhaps they are the same?
- If they have the same behavior, but you still think there is a difference, try to find out why you think there is a difference, what is causally upstream of this thought/belief?
Many more are engaged in AI Safety in other ways, eg. as PhD or independent researcher. These are just the positions we know about. We currently have not done a comprehensive survey.
Worth mentioning that most of the Cyborgism community founders came out of or did related projects in AISC beforehand.
I interpret the post you linked as trying to solve the problem of pointing to things in the real world. Being able to point to things in the real world in a way which is ontologically robust is probably necessary for alignment. However “gliders”, “strawberries” and “diamonds” seem like incredibly complicated objects to point to in a way which is ontologically robust, and it is not clear that being able to point to these objects actually lead to any kind of solution.
What we are interested in is research into how to create a statistically unique enough piece of data and being able to reliably point to that. Pointing to pure information seems like it would be more physics independent and run into less issues with ontological breakdowns.
The QACI scheme allows us to construct more complicated formal objects, using counterfactuals on these pieces of data, out of which we are able to construct a long reflection process.
Recently we modified QACI to give a scoring over actions, instead of over worlds. This should allow weaker systems inner aligned to QACI to output weaker non-DSA actions, such as the textbook from the future, or just human readable advice on how to end the acute risk period. Stronger systems might output instructions for how to go about solving corrigible AI, or something to this effect.
As for diamonds, we believe this is actually a harder problem than alignment, and it’s a mistake to aim at it. Solving diamond-maximization requires us to point at what we mean by “maximizing diamonds” in physics in a way which is ontologically robust. QACI instead gives us an easier target; informational data blobs which causally relate to a human. The cost is that we now give up power to that human user to implement their values, but this is no issue since that what we wanted to do anyways. If the humans in the QACI interval were actually pursuing diamond-maximization, instead of some form of human values, QACI would solve diamond maximization.
Transfer learning is dubious, doing philosophy has worked pretty well for me thus far for learning how to do philosophy. More specifically, pick a topic you feel confused about or a problem you want to solve (AI kill everyone oh no?). Sit down and try to do original thinking, and probably use some external tool of preference to write down your thoughts. Then do live or afterwards introspection on if your process is working and how you can improve it, repeat.
This might not be the most helpful, but most people seem to fail at “being comfortable sitting down and thinking for themselves”, and empirically being told to just do it seems to work.
Maybe one crucial object level bit has to do with something like “mining bits from vague intuitions” like Tsvi explains at the end of this comment, idk how to describe it well.