While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.
My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.
But given an ASI with such a world model, I don’t see how one would specify properties such as “hey please don’t hyperoptimize squiggles or goodhart this property”. Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to “just solve alignment/corrigbility/safe-superintelligent-behaviour”.
Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says “X”, “X” is what the AI actually does? See this post.
If you are going to prove vague things about your AI and have it be any use at all, you’d want to prove properties in the style of “this AI has the kind of ‘cognition/mind’ for which it is ‘beneficial for the user’ to have running than not” and “this AI’s ‘cognition/mind’ lies in an ‘attractor space’ where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways”.
For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI’s internals such that it is a safe computation to run, never mind formally specifying it. I don’t have, and haven’t found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be ‘beneficial to the user’, or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we’re not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.
I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in “safety ranking”. It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.
While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.
My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.
But given an ASI with such a world model, I don’t see how one would specify properties such as “hey please don’t hyperoptimize squiggles or goodhart this property”. Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to “just solve alignment/corrigbility/safe-superintelligent-behaviour”.
Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says “X”, “X” is what the AI actually does? See this post.
If you are going to prove vague things about your AI and have it be any use at all, you’d want to prove properties in the style of “this AI has the kind of ‘cognition/mind’ for which it is ‘beneficial for the user’ to have running than not” and “this AI’s ‘cognition/mind’ lies in an ‘attractor space’ where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways”.
For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI’s internals such that it is a safe computation to run, never mind formally specifying it. I don’t have, and haven’t found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be ‘beneficial to the user’, or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we’re not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.
I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in “safety ranking”. It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.