I think you have come very close to a workable answer.
Naive approach: an AI in charge of a facility that makes paperclips should take any action to ensure the paperclips must flow.
Your approach: the AI chooses actions where if it isn’t interfered with , those actions have a high probability of making a lot of paperclips. If humans have entered the facility it should shut down and the lost production during that time should not count against it’s reward heuristic.
The heuristic needs to be written in terms of “did my best when the situation was safe for me to act” and not in absolute real world terms of “made the most clips”.
The system’s scope is in picking good actions for as long as it’s “box” is sealed. It should never be designed to care what the real world outside it’s domain does, even if the real world intrudes and prevents production.
I’m not quite phrasing this one in terms of authorable code but I think we could build a toy model.
That’s actually not what I had in mind at all, though feel free to suggest your interpretation as another idea.
My idea here is more a pre-requisite to other ideas that I think are needed for alignment than a solution in itself.
By default, I assume that the AI takes into account all relevant consequences of its action that it’s aware of. However, it chooses its actions via an evaluation function that does not merely take into account the consequences, but also (or potentially only) other factors.
The most important application of this, in my view, is the idea in the comment linked in my parent comment, where the AI cares about the future only via how humans care about the future. In this case, instead of having a utility function seeking particular world states, the utility function values actions conditional on how much currently existing humans would want the actions to be taken out (if they were aware of all relevant info known to the AI).
Other applications include programming an AI to want to shut down, and not caring that a particular world-state will not be maintained after shutdown.
A potential issue: this can lead the AI to have time-inconsistent preferences, which the AI can then be motivated to make consistent. This is likely to be a particular issue if programming a shutdown, and I think less so given my main idea of caring about what current humans would want. For example, if the AI is initially programmed to maximize what humans currently want at the time of planning/decision making, it could then reprogram itself to always only care about what humans as of the time of reprogramming would want (including after death of said humans, if that occurs), which would fix[1] the time inconsistency. However I think this wouldn’t occur because humans would in fact want the AI to continue to shift the time-slice it uses for action assessment to the present (and if we didn’t, then the AI fixing it would be in some sense the “correct” decision for the benefit of our current present selves, though selfish on our part).
Apart from the time inconsistency resulting from it not yet knowing what humans actually want. However, fixing this aspect (by e.g. fixating on its current best guess world state that it thinks humans would want) should be lower E.V. than continuing to update on receiving new information, if the action evaluator takes into account: (1) the uncertainty in what humans would want, (2) the potential to obtain further information on what humans would want, (3), the AI’s potential future actions, (4) the consequences of such actions in relation to what humans want and (5) the probabilistic interrelationships between these things (so that the AI predicts that if it continues to use new information to update its assessment of what humans would want, it will take actions that better fit what humans actually would want, which on average better serves what humans would want than if it goes with its current best guess). This is a fairly tall order which is part of why I want the AI’s action evaluator to plug into the AI’s main world-model to make this assessment (which I should add as another half-baked idea)
the utility function values actions conditional on how much currently existing humans would want the actions to be taken out
How do you propose translating this into code? It is very difficult to estimate human preferences as they are incoherent and any complex question that hasn’t occurred before (a counterfactual) humans have no meaningful preferences.
Note my translation devolves to “identify privileged actions that are generally safe, specific to the task” and “don’t do things that have uncertain outcome”. Both these terms are easily translated to code.
The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague—this is the half-backed thread after all) mental model involves
a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn’t necessarily have to understand the full details at first, just the overall concept which it can then refine)
b) using some kind of legibility tool to identify how to “point at” the concepts in the neural net
c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way
However, in writing this comment reply I realized that the naive way I had been thinking that this could be doneany approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software’s connections to the neural net) would
a) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and
b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI’s flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.
No full solutions to these problems as of yetever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).
Note my translation devolves to “identify privileged actions that are generally safe, specific to the task” and “don’t do things that have uncertain outcome”. Both these terms are easily translated to code.
Neither of those things sound “easily translated to code” to me. What does “safe” mean? What does “specific to the task” mean? How do you classify outcomes as being “uncertain” or not?
I think you have come very close to a workable answer.
Naive approach: an AI in charge of a facility that makes paperclips should take any action to ensure the paperclips must flow.
Your approach: the AI chooses actions where if it isn’t interfered with , those actions have a high probability of making a lot of paperclips. If humans have entered the facility it should shut down and the lost production during that time should not count against it’s reward heuristic.
The heuristic needs to be written in terms of “did my best when the situation was safe for me to act” and not in absolute real world terms of “made the most clips”.
The system’s scope is in picking good actions for as long as it’s “box” is sealed. It should never be designed to care what the real world outside it’s domain does, even if the real world intrudes and prevents production.
I’m not quite phrasing this one in terms of authorable code but I think we could build a toy model.
That’s actually not what I had in mind at all, though feel free to suggest your interpretation as another idea.
My idea here is more a pre-requisite to other ideas that I think are needed for alignment than a solution in itself.
By default, I assume that the AI takes into account all relevant consequences of its action that it’s aware of. However, it chooses its actions via an evaluation function that does not merely take into account the consequences, but also (or potentially only) other factors.
The most important application of this, in my view, is the idea in the comment linked in my parent comment, where the AI cares about the future only via how humans care about the future. In this case, instead of having a utility function seeking particular world states, the utility function values actions conditional on how much currently existing humans would want the actions to be taken out (if they were aware of all relevant info known to the AI).
Other applications include programming an AI to want to shut down, and not caring that a particular world-state will not be maintained after shutdown.
A potential issue: this can lead the AI to have time-inconsistent preferences, which the AI can then be motivated to make consistent. This is likely to be a particular issue if programming a shutdown, and I think less so given my main idea of caring about what current humans would want. For example, if the AI is initially programmed to maximize what humans currently want at the time of planning/decision making, it could then reprogram itself to always only care about what humans as of the time of reprogramming would want (including after death of said humans, if that occurs), which would fix[1] the time inconsistency. However I think this wouldn’t occur because humans would in fact want the AI to continue to shift the time-slice it uses for action assessment to the present (and if we didn’t, then the AI fixing it would be in some sense the “correct” decision for the benefit of our current present selves, though selfish on our part).
Apart from the time inconsistency resulting from it not yet knowing what humans actually want. However, fixing this aspect (by e.g. fixating on its current best guess world state that it thinks humans would want) should be lower E.V. than continuing to update on receiving new information, if the action evaluator takes into account: (1) the uncertainty in what humans would want, (2) the potential to obtain further information on what humans would want, (3), the AI’s potential future actions, (4) the consequences of such actions in relation to what humans want and (5) the probabilistic interrelationships between these things (so that the AI predicts that if it continues to use new information to update its assessment of what humans would want, it will take actions that better fit what humans actually would want, which on average better serves what humans would want than if it goes with its current best guess). This is a fairly tall order which is part of why I want the AI’s action evaluator to plug into the AI’s main world-model to make this assessment (which I should add as another half-baked idea)
the utility function values actions conditional on how much currently existing humans would want the actions to be taken out
How do you propose translating this into code? It is very difficult to estimate human preferences as they are incoherent and any complex question that hasn’t occurred before (a counterfactual) humans have no meaningful preferences.
Note my translation devolves to “identify privileged actions that are generally safe, specific to the task” and “don’t do things that have uncertain outcome”. Both these terms are easily translated to code.
The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague—this is the half-backed thread after all) mental model involves
a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn’t necessarily have to understand the full details at first, just the overall concept which it can then refine)
b) using some kind of legibility tool to identify how to “point at” the concepts in the neural net
c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way
However, in writing this comment reply I realized that
the naive way I had been thinking that this could be doneany approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software’s connections to the neural net) woulda) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and
b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI’s flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.
No full solutions to these problems
as of yetever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).Neither of those things sound “easily translated to code” to me. What does “safe” mean? What does “specific to the task” mean? How do you classify outcomes as being “uncertain” or not?