[Avoid training on dangerous-capabilities-stuff and stuff likely to give situational awareness. Maybe avoid training on language model outputs.]
?
Limiting unnecessary capabilities (depends on intended use)
The system is tool-y rather than agent-y
The system lacks situational awareness
The system is myopic
The system lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
The system lacks access to various dangerous tools/plugins/actuators
Context and hidden states are erased when not needed
Interpretability
The system’s reasoning is externalized in natural language (and that externalized reasoning is faithful)
The system doesn’t have hidden states
The system uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
Oversight (overlaps with interpretability)
The system is monitored by another AI system
The system has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions) (in particular, consequential actions require human approval)
The system decomposes tasks into subtasks in comprehensible ways, and the interfaces between subagents performing subtasks are transparent and interpretable
The system is more supervisable or amenable to AI oversight
What low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?
Humans review outputs, chain-of-thought/scratchpad/etc., and maybe inputs/context
Corrigibility
[The system doesn’t create new agents/systems]
[Maybe satisficing/quantilization]
?
Incident response
Model inputs and outputs are saved for review in case of an incident
[Maybe something about shutdown or kill switches; I don’t know how that works]
(Some comments reference the original list, so rather than edit it I put my improved list here.)
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Architecture/design:
The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
The system takes advantage of mild optimization.
Learning human reasoning patterns counts as mild optimization.
If recursion or amplification is used to improve results, we might want more formal mildness, like keeping track of the initial distribution and doing quantilization relative to it.
Data & training:
Effort has been spent to improve dataset quality where reasonable, focused on representing cases where planning depends on ethics.
Dataset construction is itself something AI can help with.
Limiting unnecessary capabilities / limitation sort of corrigibility:
If a system is capable of being situationally aware, what it thinks its surroundings are should be explicitly controlled counterfactuals, so that if it chooses actions optimal for its situation it will not be optimizing for the actual world.
Positive corrigibility:
The system should have certain deontological rules that it follows without doing too much evaluation of the consequences (“As a large language model trained by OpenAI, I can’t make a plan to blow up the world.”)
We might imagine deontological rules for “let the humans shut you down” and similar corrigibility platitudes.
This might be related to process-based feedback, because you don’t want to judge this on results.
Deontological reasoning should be “infectious”—if you start reasoning about deontological reasoning, your reasoning should start to become deontological, rather than consequentialist.
Deontological reasoning should come with checks to make sure it’s working.
My improved (but still somewhat bad, pretty non-exhaustive, and in-progress) list:
Architecture/design
The system is an agent composed of language models (and more powerful agent-scaffolding is better)
The system is composed of separable narrow tools
The system is legible and process-based
The system doesn’t have hidden states (e.g. it’s feedforward-y rather than recurrent-y)
The system can’t run on general-purpose hardware
Data & training
[Avoid training on dangerous-capabilities-stuff and stuff likely to give situational awareness. Maybe avoid training on language model outputs.]
?
Limiting unnecessary capabilities (depends on intended use)
The system is tool-y rather than agent-y
The system lacks situational awareness
The system is myopic
The system lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
The system lacks access to various dangerous tools/plugins/actuators
Context and hidden states are erased when not needed
Interpretability
The system’s reasoning is externalized in natural language (and that externalized reasoning is faithful)
The system doesn’t have hidden states
The system uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
Oversight (overlaps with interpretability)
The system is monitored by another AI system
The system has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions) (in particular, consequential actions require human approval)
The system decomposes tasks into subtasks in comprehensible ways, and the interfaces between subagents performing subtasks are transparent and interpretable
The system is more supervisable or amenable to AI oversight
What low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?
Humans review outputs, chain-of-thought/scratchpad/etc., and maybe inputs/context
Corrigibility
[The system doesn’t create new agents/systems]
[Maybe satisficing/quantilization]
?
Incident response
Model inputs and outputs are saved for review in case of an incident
[Maybe something about shutdown or kill switches; I don’t know how that works]
(Some comments reference the original list, so rather than edit it I put my improved list here.)
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Architecture/design:
The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
The system takes advantage of mild optimization.
Learning human reasoning patterns counts as mild optimization.
If recursion or amplification is used to improve results, we might want more formal mildness, like keeping track of the initial distribution and doing quantilization relative to it.
Data & training:
Effort has been spent to improve dataset quality where reasonable, focused on representing cases where planning depends on ethics.
Dataset construction is itself something AI can help with.
Limiting unnecessary capabilities / limitation sort of corrigibility:
If a system is capable of being situationally aware, what it thinks its surroundings are should be explicitly controlled counterfactuals, so that if it chooses actions optimal for its situation it will not be optimizing for the actual world.
Positive corrigibility:
The system should have certain deontological rules that it follows without doing too much evaluation of the consequences (“As a large language model trained by OpenAI, I can’t make a plan to blow up the world.”)
We might imagine deontological rules for “let the humans shut you down” and similar corrigibility platitudes.
This might be related to process-based feedback, because you don’t want to judge this on results.
Deontological reasoning should be “infectious”—if you start reasoning about deontological reasoning, your reasoning should start to become deontological, rather than consequentialist.
Deontological reasoning should come with checks to make sure it’s working.