Dear Robert,
I just found out about your work and absolutely love it.
Has the following idea been explored yet?
The AI system is made of two agents, a strategic agent S and a controller agent C.
S’s reward function approximates the actual objective function of the system as defined by the designer.
S can only propose actions to C, only knows about the environment and the actual actions taken what C tells it, and only has as many compute resources as C gives it.
C’s reward function encodes hard constraints such as the three laws of robotics or some other formal ethical constraint system, in the form of a binary reward (1 for compliance, 0 for non-compliance).
C has access to the actual observations and has the power to either actually take the action proposed by S or not.
In addition, C is free to tell S anything regarding whether it actually took the proposed action and what the observations are, and can curtail S’s compute resources to avoid being outsmarted by S.
If indifferent in light of its reward function, C will take the proposed action, will be honest about observations, and will not curtail resources (but will not get a positive reward from this because that could be exploited by S).
Dear Robert, I just found out about your work and absolutely love it.
Has the following idea been explored yet?
The AI system is made of two agents, a strategic agent S and a controller agent C.
S’s reward function approximates the actual objective function of the system as defined by the designer.
S can only propose actions to C, only knows about the environment and the actual actions taken what C tells it, and only has as many compute resources as C gives it.
C’s reward function encodes hard constraints such as the three laws of robotics or some other formal ethical constraint system, in the form of a binary reward (1 for compliance, 0 for non-compliance).
C has access to the actual observations and has the power to either actually take the action proposed by S or not.
In addition, C is free to tell S anything regarding whether it actually took the proposed action and what the observations are, and can curtail S’s compute resources to avoid being outsmarted by S.
If indifferent in light of its reward function, C will take the proposed action, will be honest about observations, and will not curtail resources (but will not get a positive reward from this because that could be exploited by S).