Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game?
Here’s a scheme for glider alignment. Train your AGI using deep RL in an episodic Game of Life. At each step, it gets +1 reward every time the glider moves forward, and −1 reward if not (i.e. it was disrupted somehow).
I assume you are not happy with this scheme because the we arbitrarily picked out values for gliders?
Good news! While gliders can’t provide the rewards to the AGI (in order to define their values themselves), humans can provide rewards to the AI systems they are training (in order to define their values themselves). This is the basic (relevant) difference between humans and gliders, and is why I think primarily about alignment to humans and not alignment to simpler systems.
(This doesn’t get into potential problems with inner alignment and informed oversight but I don’t think those change the basic point.)
Here’s a scheme for glider alignment. Train your AGI using deep RL in an episodic Game of Life. At each step, it gets +1 reward every time the glider moves forward, and −1 reward if not (i.e. it was disrupted somehow).
I assume you are not happy with this scheme because the we arbitrarily picked out values for gliders?
Good news! While gliders can’t provide the rewards to the AGI (in order to define their values themselves), humans can provide rewards to the AI systems they are training (in order to define their values themselves). This is the basic (relevant) difference between humans and gliders, and is why I think primarily about alignment to humans and not alignment to simpler systems.
(This doesn’t get into potential problems with inner alignment and informed oversight but I don’t think those change the basic point.)