The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague—this is the half-backed thread after all) mental model involves
a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn’t necessarily have to understand the full details at first, just the overall concept which it can then refine)
b) using some kind of legibility tool to identify how to “point at” the concepts in the neural net
c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way
However, in writing this comment reply I realized that the naive way I had been thinking that this could be doneany approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software’s connections to the neural net) would
a) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and
b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI’s flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.
No full solutions to these problems as of yetever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).
Note my translation devolves to “identify privileged actions that are generally safe, specific to the task” and “don’t do things that have uncertain outcome”. Both these terms are easily translated to code.
Neither of those things sound “easily translated to code” to me. What does “safe” mean? What does “specific to the task” mean? How do you classify outcomes as being “uncertain” or not?
The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague—this is the half-backed thread after all) mental model involves
a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn’t necessarily have to understand the full details at first, just the overall concept which it can then refine)
b) using some kind of legibility tool to identify how to “point at” the concepts in the neural net
c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way
However, in writing this comment reply I realized that
the naive way I had been thinking that this could be doneany approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software’s connections to the neural net) woulda) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and
b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI’s flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.
No full solutions to these problems
as of yetever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).Neither of those things sound “easily translated to code” to me. What does “safe” mean? What does “specific to the task” mean? How do you classify outcomes as being “uncertain” or not?