One reason I care about this is that I am concerned about approaches to AI safety that involve modeling humans to try to learn human value.
I also have concerns about such approaches, and I agree with the reason you gave for being more concerned about procedure B (“it would be nice to be able to save human approval as a test set”).
I did not understand how this relates specifically to gradient descent. The tendency of gradient descent (relative to other optimization algorithms) to find unsafe solutions, assuming no inner optimizers appear, seems to me to be a fuzzy property of the problem at hand.
One could design problems in which gradient descent is expected to find less-aligned solutions than non local search algorithms (e.g. a problem in which most solutions are safe, but if you hill-climb from them you get to higher-utility-value-and-not-safe solutions). One could also design problems in which this is not the case (e.g. when everything that can go wrong is the agent breaking the vase, and breaking the vase allows higher utility solutions).
Do you have an intuition that real-world problems tend to be such that the first solution found with utility value of at least X would be better when using random sampling (assuming infinite computation power) than when using gradient descent?
when everything that can go wrong is the agent breaking the vase, and breaking the vase allows higher utility solutions
What does “breaking the vase” refer to here?
I would assume this is an allusion to the scene in The Matrix with Neo and the Oracle (where there’s a paradox about whether Neo would have broken the vase if the Oracle hadn’t said, “Don’t worry about the vase,” causing Neo to turn around to look for the vase and then bump into it), but I’m having trouble seeing how that relates to sampling and search.
“Breaking the vase” is a reference to an example that people sometimes give for an accident that happens in reinforcement learning due to the reward function not being fully aligned with what we want. The scenario is a robot that navigates in a room with a vase, and while we care about the vase, the reward function that we provided does not account for it, and so the robot just knocks it over because it is on the shortest path to somewhere.
I also have concerns about such approaches, and I agree with the reason you gave for being more concerned about procedure B (“it would be nice to be able to save human approval as a test set”).
I did not understand how this relates specifically to gradient descent. The tendency of gradient descent (relative to other optimization algorithms) to find unsafe solutions, assuming no inner optimizers appear, seems to me to be a fuzzy property of the problem at hand.
One could design problems in which gradient descent is expected to find less-aligned solutions than non local search algorithms (e.g. a problem in which most solutions are safe, but if you hill-climb from them you get to higher-utility-value-and-not-safe solutions). One could also design problems in which this is not the case (e.g. when everything that can go wrong is the agent breaking the vase, and breaking the vase allows higher utility solutions).
Do you have an intuition that real-world problems tend to be such that the first solution found with utility value of at least X would be better when using random sampling (assuming infinite computation power) than when using gradient descent?
What does “breaking the vase” refer to here?
I would assume this is an allusion to the scene in The Matrix with Neo and the Oracle (where there’s a paradox about whether Neo would have broken the vase if the Oracle hadn’t said, “Don’t worry about the vase,” causing Neo to turn around to look for the vase and then bump into it), but I’m having trouble seeing how that relates to sampling and search.
Presumably this:
https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting
“Breaking the vase” is a reference to an example that people sometimes give for an accident that happens in reinforcement learning due to the reward function not being fully aligned with what we want. The scenario is a robot that navigates in a room with a vase, and while we care about the vase, the reward function that we provided does not account for it, and so the robot just knocks it over because it is on the shortest path to somewhere.