Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you’d be happier to see some kinds of systems than others.
For example, Paul Christiano suggests “LM agents are an unusually safe way to build powerful AI systems.” He says “My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.”
My quick list is below; I’m interested in object-level suggestions, meta observations, reading recommendations, etc. I’m particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.
All else equal, it seems safer if an AI system:
Is more interpretable
If its true thoughts are transparent and expressed in natural language (see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)
(what else?);
Has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions);
Decomposes tasks into subtasks in comprehensible ways, and in particular if the interfaces between subagents performing subtasks are transparent and interpretable;
Is more supervisable or amenable to AI oversight (what low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?);
Is feedforward-y rather than recurrent-y (because recurrent-y systems have hidden states? so this is part of interpretability/overseeability?);
Is myopic;
Lacks situational awareness;
Lacks various dangerous capabilities (coding, weapon-building, human-modeling, planning);
Is more corrigible (what lower-level desirable properties determine corrigibility? what determines whether systems have those properties?) (note to self: see 1, 2, 3, 4, and comments on 5);
Is legible and process-based;
Is composed of separable narrow tools;
These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both “more interpretable” and [whatever low-level features make systems more interpretable].
If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?
Obviously I’m mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it’s far behind unsafe systems.
I though I was going to have more ideas, but after sitting around a bit I’ll just post what I had in terms of friendliness architectural/design choices that are disjoint from safety choices:
Moral and procedural self-reflection:
Human feedback / human in the loop
System should do active inference / actively help humans give good feedback.
Situational awareness to the extent that the situation at hand is relevant to self-reflection.
Bootstrapping:
The system should start out safe, with a broad knowledge about humans and ability to implement human feedback.
The “Broad knowledge yet safe” part definitely incentivizes language models being involved somewhere.
Implementing feedback, however, means more dangerous technology such as automated interpretability or a non-transformer architecture that’s good at within-lifetime learning.
Moral pluralism / conservatism
The central idea is to avoid (or at least be averse to) parts of state space where the learned notions of value are worse abstractions / where different ways of generalizing human values wildly diverge.
Requires specifying goals not in terms of a value function (or similar), but in terms of a recipe for learning about the world that eventually spits out a value function (or similar).
Easier to do if the outer layer of the system is classical AI, though we probably shouldn’t rule out options.
Dataset / Human-side design:
Broad sample of humanity is able to give feedback.
Transparency and accountability of the feedback process is important.
Capabilities useful for friendliness and problems for safety:
Situational awareness
Independent planning and action when needed (some amount of consequentialist evaluation of soliciting human feedback/oversight).
Generality
Within-lifetime learning
My improved (but still somewhat bad, pretty non-exhaustive, and in-progress) list:
Architecture/design
The system is an agent composed of language models (and more powerful agent-scaffolding is better)
The system is composed of separable narrow tools
The system is legible and process-based
The system doesn’t have hidden states (e.g. it’s feedforward-y rather than recurrent-y)
The system can’t run on general-purpose hardware
Data & training
[Avoid training on dangerous-capabilities-stuff and stuff likely to give situational awareness. Maybe avoid training on language model outputs.]
?
Limiting unnecessary capabilities (depends on intended use)
The system is tool-y rather than agent-y
The system lacks situational awareness
The system is myopic
The system lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
The system lacks access to various dangerous tools/plugins/actuators
Context and hidden states are erased when not needed
Interpretability
The system’s reasoning is externalized in natural language (and that externalized reasoning is faithful)
The system doesn’t have hidden states
The system uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
Oversight (overlaps with interpretability)
The system is monitored by another AI system
The system has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions) (in particular, consequential actions require human approval)
The system decomposes tasks into subtasks in comprehensible ways, and the interfaces between subagents performing subtasks are transparent and interpretable
The system is more supervisable or amenable to AI oversight
What low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?
Humans review outputs, chain-of-thought/scratchpad/etc., and maybe inputs/context
Corrigibility
[The system doesn’t create new agents/systems]
[Maybe satisficing/quantilization]
?
Incident response
Model inputs and outputs are saved for review in case of an incident
[Maybe something about shutdown or kill switches; I don’t know how that works]
(Some comments reference the original list, so rather than edit it I put my improved list here.)
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Architecture/design:
The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
The system takes advantage of mild optimization.
Learning human reasoning patterns counts as mild optimization.
If recursion or amplification is used to improve results, we might want more formal mildness, like keeping track of the initial distribution and doing quantilization relative to it.
Data & training:
Effort has been spent to improve dataset quality where reasonable, focused on representing cases where planning depends on ethics.
Dataset construction is itself something AI can help with.
Limiting unnecessary capabilities / limitation sort of corrigibility:
If a system is capable of being situationally aware, what it thinks its surroundings are should be explicitly controlled counterfactuals, so that if it chooses actions optimal for its situation it will not be optimizing for the actual world.
Positive corrigibility:
The system should have certain deontological rules that it follows without doing too much evaluation of the consequences (“As a large language model trained by OpenAI, I can’t make a plan to blow up the world.”)
We might imagine deontological rules for “let the humans shut you down” and similar corrigibility platitudes.
This might be related to process-based feedback, because you don’t want to judge this on results.
Deontological reasoning should be “infectious”—if you start reasoning about deontological reasoning, your reasoning should start to become deontological, rather than consequentialist.
Deontological reasoning should come with checks to make sure it’s working.
Maybe relatively safe if:
Not too big
No self-improvement
No continual learning
Curated training data, no throwing everything into the cauldron
No access to raw data from the environment
Not curious or novelty-seeking
Not trying to maximize or minimize anything or push anything to the limit
Not capable enough for catastrophic misuse by humans
Another factor for the safest type of AGI is one that can practically be built soon.
The perfect is the enemy of the good. A perfectly safe system that will be deployed five years after the first self-improving AGI is probably useless.
Of course the safest path is to never build an agentic AGI. But that seems unlikely.
This criteria is another argument for language model agents. I’ve outlined their list of safety advantages here.
Of course, we don’t know if language model agents will achieve full AGI.
Another path to AGI that seems both achievable and alignable is loosely brainlike AGI, along the lines of LeCun’s proposed H-JEPA. Steve Byrnes’ “plan for mediocre alignment” seems extensible to become quite a good plan for this type of AGI.
Someone anonymously suggests: