New(ish) AI control ideas
The list of posts is getting unwieldy, so I’ll post the up-to-date stuff at the beginning:
Humans inconsistencies:
Reward function learning:
Understanding humans:
Framework:
Acausal trade:
Oracle designs:
Extracting human values:
Corrigibility:
Indifference:
AIs in virtual worlds:
True answers from AI:
Miscellanea:
Migrating my old post over from Less Wrong.
I recently went on a two day intense solitary “AI control retreat”, with the aim of generating new ideas for making safe AI. The “retreat” format wasn’t really a success (“focused uninterrupted thought” was the main gain, not “two days of solitude”—it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that’s you, folks) to test them for viability.
A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.
To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:
@. The AI is much smarter than us. @. It’s not well defined. @. The setup can be hacked.
By the agent.
By outsiders, including other AI.
Adding restrictions encourages the AI to hack them, not obey them. @. The agent will resist changes. @. Humans can be manipulated, hacked, or seduced. @. The design is not stable.
Under self-modification.
Under subagent creation.
Unrestricted search is dangerous. @. The agent has, or will develop, dangerous goals.
Important background ideas:
I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave “nicely” by default (see eg here). If we wanted that, we should define what “nicely” is, and put that in by hand.
I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:
Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
Creating a satisficer (EDIT: I have big doubts about this approach, currently)
While the less important or developed ideas are:
Added: Acausal trade barriers
Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean “X depends on Y”, “Y is useful for X”, “X complements Y on this problem” or even “Y inspires X”):
EDIT: I’ve decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:
Short tricks:
High-impact from low impact:
The president didn’t die: failures at extending AI behaviour
Presidents, asteroids, natural categories, and reduced impact
High impact from low impact, best advice:
Overall meta-thoughts:
Pareto-improvements to corrigible agents:
AIs in virtual worlds:
Low importance AIs:
Wireheading:
AI honesty and testing:
Goal completion:
- What AI Safety Researchers Have Written About the Nature of Human Values by Jan 16, 2019, 1:59 PM; 52 points) (
- JFK was not assassinated: prior probability zero events by Apr 27, 2016, 11:47 AM; 38 points) (
- New(ish) AI control ideas by Mar 5, 2015, 5:03 PM; 34 points) (
- An overall schema for the friendly AI problems: self-referential convergence criteria by Jul 13, 2015, 3:34 PM; 26 points) (
- Acausal trade barriers by Mar 11, 2015, 1:40 PM; 23 points) (
- False thermodynamic miracles by Mar 5, 2015, 5:04 PM; 19 points) (
- The president didn’t die: failures at extending AI behaviour by Jun 10, 2015, 4:00 PM; 17 points) (
- Crude measures by Mar 27, 2015, 3:44 PM; 16 points) (
- Green Emeralds, Grue Diamonds by Jul 6, 2015, 11:27 AM; 14 points) (
- AI, cure this fake person’s fake cancer! by Aug 24, 2015, 4:42 PM; 14 points) (
- Indifferent vs false-friendly AIs by Mar 24, 2015, 12:13 PM; 14 points) (
- Extending the stated objectives by Jan 13, 2016, 4:20 PM; 13 points) (
- A counterfactual and hypothetical note on AI safety design by Mar 11, 2015, 4:20 PM; 13 points) (
- The subagent problem is really hard by Sep 18, 2015, 1:06 PM; 13 points) (
- Resource gathering and pre-corriged agents by Mar 10, 2015, 11:47 AM; 13 points) (
- Detecting agents and subagents by Mar 10, 2015, 5:56 PM; 13 points) (
- Un-optimised vs anti-optimised by Apr 14, 2015, 6:30 PM; 12 points) (
- Counterfactually uninfluenceable agents by Jun 2, 2017, 4:17 PM; 11 points) (
- Values at compile time by Mar 26, 2015, 12:25 PM; 11 points) (
- Grue, Bleen, and natural categories by Jul 6, 2015, 1:47 PM; 11 points) (
- High impact from low impact by Apr 17, 2015, 4:01 PM; 11 points) (
- The Ultimate Testing Grounds by Oct 28, 2015, 5:08 PM; 11 points) (
- The AI, the best human advisor by Jul 13, 2015, 3:33 PM; 11 points) (
- Acausal trade: double decrease by Jun 2, 2017, 3:33 PM; 10 points) (
- Defining a limited satisficer by Mar 11, 2015, 2:23 PM; 10 points) (
- Goal completion: noise, errors, bias, prejudice, preference and complexity by Feb 18, 2016, 2:37 PM; 10 points) (
- Heroin model: AI “manipulates” “unmanipulatable” reward by Sep 22, 2016, 10:27 AM; 10 points) (
- Models as definitions by Mar 25, 2015, 5:46 PM; 10 points) (
- The virtual AI within its virtual world by Aug 24, 2015, 4:42 PM; 10 points) (
- Superintelligence and wireheading by Oct 23, 2015, 2:49 PM; 10 points) (
- Assessors that are hard to seduce by Mar 9, 2015, 2:19 PM; 9 points) (
- Restrictions that are hard to hack by Mar 9, 2015, 1:52 PM; 9 points) (
- Divergent preferences and meta-preferences by Jun 2, 2017, 3:51 PM; 9 points) (
- Goal completion: the rocket equations by Jan 20, 2016, 1:54 PM; 9 points) (
- Tackling the subagent problem: preliminary analysis by Jan 12, 2016, 12:26 PM; 9 points) (
- Double Corrigibility: better Corrigibility by Apr 28, 2016, 2:46 PM; 9 points) (
- What I mean... by Mar 26, 2015, 11:59 AM; 9 points) (
- Chatbots or set answers, not WBEs by Sep 8, 2015, 5:17 PM; 8 points) (
- AI utility-based correlation by Oct 30, 2015, 2:53 PM; 8 points) (
- Ask and ye shall be answered by Sep 18, 2015, 9:53 PM; 8 points) (
- Corrigibility through stratified indifference by Aug 19, 2016, 4:11 PM; 8 points) (
- Predicted corrigibility: pareto improvements by Aug 18, 2015, 11:02 AM; 8 points) (
- Creating a satisficer by Mar 11, 2015, 3:03 PM; 8 points) (
- Goal completion: algorithm ideas by Jan 25, 2016, 5:36 PM; 8 points) (
- An Oracle standard trick by Jun 3, 2015, 2:17 PM; 7 points) (
- Intelligence modules by Mar 23, 2015, 4:24 PM; 7 points) (
- True answers from AI: Summary by Mar 10, 2016, 3:56 PM; 7 points) (
- Anti-Pascaline agent by Mar 12, 2015, 2:17 PM; 7 points) (
- Anti-Pascaline satisficer by Apr 14, 2015, 6:49 PM; 6 points) (
- Utility vs Probability: idea synthesis by Mar 27, 2015, 12:30 PM; 6 points) (
- Continually-adjusted discounted preferences by Mar 6, 2015, 4:03 PM; 6 points) (
- Closest stable alternative preferences by Mar 20, 2015, 12:41 PM; 6 points) (
- Counterfactual do-what-I-mean by Oct 27, 2016, 1:54 PM; 5 points) (
- One weird trick to turn maximisers into minimisers by Apr 22, 2016, 4:47 PM; 5 points) (
- Consistent Plato by Mar 20, 2015, 6:06 PM; 5 points) (
- Humans get different counterfactuals by Mar 23, 2015, 2:54 PM; 4 points) (
- The overfitting utility problem for value learning AIs by Jun 12, 2016, 11:25 PM; 4 points) (
- Guarded learning by May 23, 2017, 4:53 PM; 4 points) (
- Learning (meta-)preferences by Jul 27, 2016, 2:43 PM; 4 points) (
- An algorithm with preferences: from zero to one variable by Jun 2, 2017, 4:35 PM; 4 points) (
- High impact from low impact, continued by Apr 28, 2015, 12:58 PM; 4 points) (
- Virtual models of virtual AIs in virtual worlds by Mar 11, 2016, 9:41 AM; 3 points) (
- Uninfluenceable learning agents by Jun 2, 2017, 4:30 PM; 3 points) (
- Humans can be assigned any values whatsoever... by Oct 24, 2017, 12:03 PM; 3 points) (
- How the virtual AI controls itself by Sep 9, 2015, 2:25 PM; 3 points) (
- Presidents, asteroids, natural categories, and reduced impact by Jul 6, 2015, 5:44 PM; 3 points) (
- Utility, probability and false beliefs by Nov 9, 2015, 9:43 PM; 3 points) (
- Goal completion: algorithm ideas by Jan 26, 2016, 10:01 AM; 2 points) (
- Double indifference is better indifference by May 4, 2016, 2:16 PM; 2 points) (
- Learning values versus indifference by May 24, 2017, 8:20 AM; 2 points) (
- Simpler, cruder, virtual world AIs by Jun 26, 2016, 3:44 PM; 2 points) (
- What does an imperfect agent want? by Jul 27, 2016, 2:03 PM; 2 points) (
- Simplified explanation of stratification by May 23, 2017, 4:37 PM; 2 points) (
- Counterfactuals on POMDP by Jun 2, 2017, 4:30 PM; 2 points) (
- Corrigibility thoughts I: caring about multiple things by Jun 2, 2017, 4:27 PM; 2 points) (
- Thoughts on Quantilizers by Jun 2, 2017, 4:24 PM; 2 points) (
- All the indifference designs by Jun 2, 2017, 4:20 PM; 2 points) (
- AI safety: three human problems and one AI issue by Jun 2, 2017, 4:12 PM; 2 points) (
- Conservation of Expected Ethics isn’t enough by Jun 15, 2016, 6:08 PM; 1 point) (
- (C)IRL is not solely a learning process by Sep 15, 2016, 8:35 AM; 1 point) (
- Rigged reward learning by Mar 16, 2018, 3:39 PM; 1 point) (
- Emergency learning by Jun 2, 2017, 4:23 PM; 1 point) (
- Humans as a truth channel by Jun 2, 2017, 4:22 PM; 1 point) (
- Low impact versus low side effects by Jun 2, 2017, 4:14 PM; 1 point) (
- Acausal trade: different utilities, different trades by Jun 2, 2017, 3:33 PM; 1 point) (
- Acausal trade: universal utility, or selling non-existence insurance too late by Jun 2, 2017, 3:33 PM; 1 point) (
- “Like this world, but...” by Jul 14, 2017, 8:40 PM; 1 point) (
- Resolving human inconsistency in a simple model by Oct 4, 2017, 3:02 PM; 1 point) (
- Help needed: nice AIs and presidential deaths by Jun 8, 2015, 4:47 PM; 1 point) (
- Goal completion: the rocket equations by Jan 20, 2016, 2:10 PM; 0 points) (
- Goal completion: noise, errors, bias, prejudice, preference and complexity by May 24, 2017, 8:30 AM; 0 points) (
- True answers from AI by Mar 31, 2016, 3:00 PM; 0 points) (
- True answers from AI: Summary by Mar 10, 2016, 4:31 PM; 0 points) (
- True answers from AI: “Miraculous” hypotheses by Mar 10, 2016, 3:09 PM; 0 points) (
- AI printing the utility value it’s maximising by May 24, 2017, 9:08 AM; 0 points) (
- Convexity and truth-seeking by May 22, 2017, 6:12 PM; 0 points) (
- One weird trick to turn maximisers into minimisers by Apr 22, 2016, 4:45 PM; 0 points) (
- JFK was not assassinated: prior probability zero events by Apr 27, 2016, 12:50 PM; 0 points) (
- Corrigibility for AIXI via double indifference by May 4, 2016, 2:00 PM; 0 points) (
- AIs in virtual worlds: discounted mixed utility/reward by Jun 17, 2016, 6:43 AM; 0 points) (
- The alternate hypothesis for AIs in virtual worlds by May 24, 2017, 8:14 AM; 0 points) (
- Confirmed Selective Oracle by Jun 10, 2016, 11:43 PM; 0 points) (
- Indifference utility functions by Jun 11, 2016, 12:20 AM; 0 points) (
- Learning desiderata by Jun 16, 2016, 5:30 AM; 0 points) (
- When the AI closes a door, it opens a window by May 24, 2017, 8:49 AM; 0 points) (
- Abstract model of human bias by Jul 6, 2016, 11:08 AM; 0 points) (
- Corrigibility through stratified indifference and learning by May 23, 2017, 4:33 PM; 0 points) (
- The non-indifferent behaviour of stratified indifference? by Aug 22, 2016, 1:51 PM; 0 points) (
- Stratified learning and action by Sep 15, 2016, 8:59 AM; 0 points) (
- Heroin model: AI “manipulates” “unmanipulatable” reward by Sep 21, 2016, 6:10 PM; 0 points) (
- Model of human (ir)rationality by Sep 26, 2016, 11:20 AM; 0 points) (
- Learning doesn’t solve philosophy of ethics by Sep 26, 2016, 12:11 PM; 0 points) (
- Counterfactual do-what-I-mean by Oct 27, 2016, 1:53 PM; 0 points) (
- Reward/value learning for reinforcement learning by Jun 2, 2017, 4:34 PM; 0 points) (
- The best value indifference method (so far) by Jun 2, 2017, 4:33 PM; 0 points) (
- How to judge moral learning failure by Jun 2, 2017, 4:32 PM; 0 points) (
- Ontology, lost purposes, and instrumental goals by Jun 2, 2017, 4:28 PM; 0 points) (
- Corrigibility thoughts II: the robot operator by Jun 2, 2017, 4:27 PM; 0 points) (
- Corrigibility thoughts III: manipulating versus deceiving by Jun 2, 2017, 4:27 PM; 0 points) (
- The radioactive burrito and learning from positive examples by Jun 2, 2017, 4:25 PM; 0 points) (
- Indifference and compensatory rewards by Jun 2, 2017, 4:19 PM; 0 points) (
- Translation “counterfactual” by Jun 2, 2017, 4:16 PM; 0 points) (
- Understanding the important facts by Jun 2, 2017, 4:15 PM; 0 points) (
- Agents that don’t become maximisers by Jun 2, 2017, 4:13 PM; 0 points) (
- Acausal trade: trade barriers by Jun 2, 2017, 3:32 PM; 0 points) (
- Optimisation in manipulating humans: engineered fanatics vs yes-men by Jun 2, 2017, 3:51 PM; 0 points) (
- New circumstances, new values? by Jun 6, 2017, 8:18 AM; 0 points) (
- Humans are not agents: short vs long term by Jun 27, 2017, 1:04 PM; 0 points) (
- Rationality and overriding human preferences: a combined model by Oct 20, 2017, 10:47 PM; 0 points) (
- Learning values, or defining them? by Nov 6, 2017, 10:59 AM; 0 points) (
- Bias in rationality is much worse than noise by Nov 6, 2017, 11:08 AM; 0 points) (
- Normative assumptions: regret by Nov 6, 2017, 10:59 AM; 0 points) (
- Our values are underdefined, changeable, and manipulable by Nov 6, 2017, 10:59 AM; 0 points) (
- Reward learning summary by Nov 28, 2017, 3:55 PM; 0 points) (
- Kolmogorov complexity makes reward learning worse by Nov 6, 2017, 8:08 PM; 0 points) (
- Rationalising humans: another mugging, but not Pascal’s by Nov 15, 2017, 12:07 PM; 0 points) (
- Stable agent, subagent-unstable by Nov 28, 2017, 4:05 PM; 0 points) (
Thanks! I love having central repos.
A quick question / comment, RE: “I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections.”
Q: What do you mean (or have in mind) in terms of “turning [...] objections”? I’m not very familiar with the phrase.
Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you’ve given. Recently I’ve been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it’s sensors/actuators) might significantly increase how long it takes to escape.