lincolnquirk comments on Let’s See You Write That Corrigibility Tag

lincolnquirk 24 Jun 2022 18:30 UTC
5 points
1
Here’s my attempt. I haven’t read any of the other comments or the tag yet. I probably spent ~60-90m total on this, spread across a few days.

On kill switches
- low impact somehow but I don’t know how
- Go slow enough so that people can see what you’re doing
- Have a bunch of “safewords” and other kill-switches installed at different places, some hopefully hard-to-reach by the AI. Test them regularly, and consider it a deadly flaw if one stops working.
On the AI accurately knowing what it is doing, and pointing at things in the real world
- watch all the metrics (!)
- Predict all the metrics you watch, and ask humans about any anomalous metrics that you are watching
- group inputs and outputs separately and treat inputs as sacred. Perhaps by having an epistemic module which is incentivized by producing true predictions about the world. Make the epistemic module the one that gets all the juice. Planning module should not have any way to influence it.
On responding predictably
- Require inputs of some kind to get more outputs (something about control theory?) - like a power steering wheel, or an ebike, it helps the human by amplifying the motions you make but not actually Doing Anything on its own
- Have metrics look smooth on an input/output response curve. No sharp edges. let the humans be the one to turn the knobs.
- Try things. Try things on a small scale before going big. (define “big”? maybe look at the metrics!)
On epistemology and ontology
- Ask for feedback both about things you tried; and experiments you want to run
- make the epistemic module have a stable ontology. If it wants to change the ontology, it needs to create a new epistemic module, and leave the old one running. The new module has to be able not only to predict physical things about the world but also what the old model would say about things being pointed-to and tagged as important in the old models ontology.
- have a concept of goal metrics and side effects. Try to predict what the human will say about which metrics are goals and which are side effects. (hmm..this is incoherent because you can come up with many irrelevant metrics… what even is a relevant metric. Maybe the system should be grouping metrics using some kind of natural unsupervised clustering system and then trying to predict the prioritization of metric groups? And then to explain the groups it gives samples of the actual thing it’s measuring. Still difficult to make sense of this. )
On planning
- Bound or penalize plans in complexity and time.
  - “complexity” is hard to define here, and isn’t Kolmogorov I don’t think.
- Harshly penalize plans which lean on predicting what other agents will do. Ban plans which depend on prediction of what your operator will do. (AI can still predict what the operator will think of the plan, but that prediction should not feed into any plan itself.)
- Separate planning from other parts of AI cognition, such that you are never confused about whether you are making a plan or doing other things, such as predicting things about the world, or executing a plan.
- Don’t execute plans at all
- Train the AI to identify the purpose(s) of each step in the plan. Give negative feedback when steps lead to outcomes that are not identified ahead of time.
On transparency
- A part of the AI should be trained to explicate what the rest of the AI is “doing”/focused on at all times. Find a way to make this into a feedback cycle.
- AI should be arranged so it is mostly only doing one thing at a time, and that doesn’t switch its focus too many times each minute.