I recently went on a two day intense solitary “AI control retreat”, with the aim of generating new ideas for making safe AI. The “retreat” format wasn’t really a success (“focused uninterrupted thought” was the main gain, not “two days of solitude”—it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that’s you, folks) to test them for viability.
A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.
To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:
@. The AI is much smarter than us.
@. It’s not well defined.
@. The setup can be hacked.
By the agent.
By outsiders, including other AI.
Adding restrictions encourages the AI to hack them, not obey them.
@. The agent will resist changes.
@. Humans can be manipulated, hacked, or seduced.
@. The design is not stable.
Under self-modification.
Under subagent creation.
Unrestricted search is dangerous.
@. The agent has, or will develop, dangerous goals.
I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave “nicely” by default (see eg here). If we wanted that, we should define what “nicely” is, and put that in by hand.
I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:
Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean “X depends on Y”, “Y is useful for X”, “X complements Y on this problem” or even “Y inspires X”):
EDIT: I’ve decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:
New(ish) AI control ideas
The list of posts is getting unwieldy, so I’ll post the up-to-date stuff at the beginning:
Humans inconsistencies:
Bias in rationality is much worse than noise
Learning values, or defining them?
Rationality and overriding rewards: combined model
Humans can be assigned any values whatsoever...
Resolving human inconsistency in a simple model
Short term vs long term preferences
Reward function learning:
Reward function learning
Biased reward learning
Counterfactuals on POMDPs
Uninfluenceable agents
Counterfactually uninfluenceable agents
An algorithm with preferences: from zero to one variable
New circumstances, new values?
Understanding humans:
Humans as truth channel
True understanding comes from passing exams
Understanding the important facts
Framework:
Three human problems and one AI issue
Acausal trade:
Introduction
Double decrease
Pre-existence deals
Full decision algorithms
Breaking acausal trade
Trade in different types of utility functions
Being unusual
Conclusion
Oracle designs:
Three Oracle designs
Extracting human values:
Divergent preferences and meta-preferences
Engineered fanatics vs yes-men
Learning doesn’t solve philosophy of ethics
Models of human irrationality
Heroin model: AI “manipulates” “unmanipulatable” reward
Stratified learning and action
(C)IRL is not solely a learning process
Learning (meta-)preferences
What does an imperfect agent want?
Window problem for manipulating human values
Abstract model of human bias
Paul Christiano’s post
Corrigibility:
Corrigibility thoughts I: caring about multiple things
Corrigibility thoughts II: the robot operator
Corrigibility thoughts III: manipulating versus deceiving
Corrigibility through stratification
Cake or Death toy model for corrigibility
Learning values versus indifference
Conservation of expected ethics/evil (isn’t enough)
Guarded learning
Indifference:
Translation “counterfactual”
Indifference and compensatory rewards
All the indifference designs
The “best” value indifference method
Double indifference
Corrigibility for AIXI
Indifference utility functions
AIs in virtual worlds:
AIs in virtual worlds
The alternate hypothesis for AIs in virtual worlds
AIs in virtual worlds: discounted mixed utility/reward
Simpler, cruder, virtual world AIs
True answers from AI:
True answers from AIs
What to do with very low probabilities
Summary of true answers from AIs
AI’s printing the expected utility of the utility it’s maximising
Low impact vs low side effects
Miscellanea:
Confirmed Selective Oracle
One weird trick to turn maximisers into minimisers
The overfitting utility problem for value learning AIs
Change utility, reduce extortion
Agents that don’t become maximisers
Emergency learning
Thoughts on quantilizers
The radioactive burrito and learning from positive examples
Ontology, lost purposes, and instrumental goals
How to judge moral learning failure
Migrating my old post over from Less Wrong.
I recently went on a two day intense solitary “AI control retreat”, with the aim of generating new ideas for making safe AI. The “retreat” format wasn’t really a success (“focused uninterrupted thought” was the main gain, not “two days of solitude”—it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that’s you, folks) to test them for viability.
A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.
To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:
@. The AI is much smarter than us. @. It’s not well defined. @. The setup can be hacked.
By the agent.
By outsiders, including other AI.
Adding restrictions encourages the AI to hack them, not obey them. @. The agent will resist changes. @. Humans can be manipulated, hacked, or seduced. @. The design is not stable.
Under self-modification.
Under subagent creation.
Unrestricted search is dangerous. @. The agent has, or will develop, dangerous goals.
Important background ideas:
Utility Indifference
Safe value change
Corrigibility
Reduced impact AI
I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave “nicely” by default (see eg here). If we wanted that, we should define what “nicely” is, and put that in by hand.
I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:
Anti-pascaline agent
Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
Creating a satisficer (EDIT: I have big doubts about this approach, currently)
Crude measures
False miracles
Intelligence modules
Models as definitions
Added: Utility vs Probability: idea synthesis
While the less important or developed ideas are:
Added: A counterfactual and hypothetical note on AI design
Added: Acausal trade barriers
Anti-seduction
Closest stable alternative
Consistent Plato
Defining a proper satisficer
Detecting subagents
Added: Humans get different counterfactual
Added: Indifferent vs false-friendly AIs
Resource gathering and pre-corrigied agent
Time-symmetric discount rate
Values at compile time
What I mean
Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean “X depends on Y”, “Y is useful for X”, “X complements Y on this problem” or even “Y inspires X”):
EDIT: I’ve decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:
Short tricks:
Un-optimised vs anti-optimised
Anti-Pascaline satisficer
An Oracle standard trick
High-impact from low impact:
High impact from low impact
High impact from low impact, continued
Help needed: nice AIs and presidential deaths
The president didn’t die: failures at extending AI behaviour
Green Emeralds, Grue Diamonds
Grue, Bleen, and natural categories
Presidents, asteroids, natural categories, and reduced impact
High impact from low impact, best advice:
The AI as “best” human advisor
Using chatbots or set answers
Overall meta-thoughts:
An overall schema for the friendly AI problems: self-referential convergence criteria
The subagent problem is really hard
Tackling the subagent problem: preliminary analysis
Pareto-improvements to corrigible agents:
Predicted corrigibility: pareto improvements
AIs in virtual worlds:
Using the AI’s output in virtual worlds: cure a fake cancer
Having an AI model itself as virtual agent in a virtual world
How the virtual AI controls itself
Low importance AIs:
Counterfactual agents detecting agent’s influence
Wireheading:
Superintelligence and wireheading
AI honesty and testing:
Question an AI to get an honest answer
The Ultimate Testing Grounds
The mathematics of the testing grounds
Utility, probability, and false beliefs
Goal completion:
Extending the stated objectives
Goal completion: the rocket equations
Goal completion: algorithm ideas
Goal completion: noise, errors, bias, prejudice, preference and complexity