More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Architecture/design:
The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
The system takes advantage of mild optimization.
Learning human reasoning patterns counts as mild optimization.
If recursion or amplification is used to improve results, we might want more formal mildness, like keeping track of the initial distribution and doing quantilization relative to it.
Data & training:
Effort has been spent to improve dataset quality where reasonable, focused on representing cases where planning depends on ethics.
Dataset construction is itself something AI can help with.
Limiting unnecessary capabilities / limitation sort of corrigibility:
If a system is capable of being situationally aware, what it thinks its surroundings are should be explicitly controlled counterfactuals, so that if it chooses actions optimal for its situation it will not be optimizing for the actual world.
Positive corrigibility:
The system should have certain deontological rules that it follows without doing too much evaluation of the consequences (“As a large language model trained by OpenAI, I can’t make a plan to blow up the world.”)
We might imagine deontological rules for “let the humans shut you down” and similar corrigibility platitudes.
This might be related to process-based feedback, because you don’t want to judge this on results.
Deontological reasoning should be “infectious”—if you start reasoning about deontological reasoning, your reasoning should start to become deontological, rather than consequentialist.
Deontological reasoning should come with checks to make sure it’s working.
More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):
Architecture/design:
The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
The system takes advantage of mild optimization.
Learning human reasoning patterns counts as mild optimization.
If recursion or amplification is used to improve results, we might want more formal mildness, like keeping track of the initial distribution and doing quantilization relative to it.
Data & training:
Effort has been spent to improve dataset quality where reasonable, focused on representing cases where planning depends on ethics.
Dataset construction is itself something AI can help with.
Limiting unnecessary capabilities / limitation sort of corrigibility:
If a system is capable of being situationally aware, what it thinks its surroundings are should be explicitly controlled counterfactuals, so that if it chooses actions optimal for its situation it will not be optimizing for the actual world.
Positive corrigibility:
The system should have certain deontological rules that it follows without doing too much evaluation of the consequences (“As a large language model trained by OpenAI, I can’t make a plan to blow up the world.”)
We might imagine deontological rules for “let the humans shut you down” and similar corrigibility platitudes.
This might be related to process-based feedback, because you don’t want to judge this on results.
Deontological reasoning should be “infectious”—if you start reasoning about deontological reasoning, your reasoning should start to become deontological, rather than consequentialist.
Deontological reasoning should come with checks to make sure it’s working.