I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.
Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?
Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn’t a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.
As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?
I have no concrete idea how to accomplish the preceding things and don’t expect that anyone else does either. Maybe I’ll be pleasantly surprised.
Barring this kind of fundamental accomplishment for alignment I think it’s foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can’t ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can’t even align for planaria. Claiming that planaria or gliders don’t have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don’t get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.
Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game?
Here’s a scheme for glider alignment. Train your AGI using deep RL in an episodic Game of Life. At each step, it gets +1 reward every time the glider moves forward, and −1 reward if not (i.e. it was disrupted somehow).
I assume you are not happy with this scheme because the we arbitrarily picked out values for gliders?
Good news! While gliders can’t provide the rewards to the AGI (in order to define their values themselves), humans can provide rewards to the AI systems they are training (in order to define their values themselves). This is the basic (relevant) difference between humans and gliders, and is why I think primarily about alignment to humans and not alignment to simpler systems.
(This doesn’t get into potential problems with inner alignment and informed oversight but I don’t think those change the basic point.)
Hm, upon more thought I actually kind of endorse this as a demo. I think we should be able to run an alignment scheme on c.
elegans and get out a universe full of well-fed worms, and that’s a decent sign that we didn’t screw up, despite the fact that it doesn’t engage with several key problems that arise in humans because we’re more complicated, have preferences about pur preferences, etc. No weird worm-stimulation should be needed. But we do have to accept that we’re not getting some notion of values independent of an act of interpretation.
I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.
Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?
Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn’t a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.
As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?
I have no concrete idea how to accomplish the preceding things and don’t expect that anyone else does either. Maybe I’ll be pleasantly surprised.
Barring this kind of fundamental accomplishment for alignment I think it’s foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can’t ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can’t even align for planaria. Claiming that planaria or gliders don’t have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don’t get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.
Here’s a scheme for glider alignment. Train your AGI using deep RL in an episodic Game of Life. At each step, it gets +1 reward every time the glider moves forward, and −1 reward if not (i.e. it was disrupted somehow).
I assume you are not happy with this scheme because the we arbitrarily picked out values for gliders?
Good news! While gliders can’t provide the rewards to the AGI (in order to define their values themselves), humans can provide rewards to the AI systems they are training (in order to define their values themselves). This is the basic (relevant) difference between humans and gliders, and is why I think primarily about alignment to humans and not alignment to simpler systems.
(This doesn’t get into potential problems with inner alignment and informed oversight but I don’t think those change the basic point.)
Hm, upon more thought I actually kind of endorse this as a demo. I think we should be able to run an alignment scheme on c. elegans and get out a universe full of well-fed worms, and that’s a decent sign that we didn’t screw up, despite the fact that it doesn’t engage with several key problems that arise in humans because we’re more complicated, have preferences about pur preferences, etc. No weird worm-stimulation should be needed. But we do have to accept that we’re not getting some notion of values independent of an act of interpretation.