simon comments on Let’s See You Write That Corrigibility Tag

simon 20 Jun 2022 15:45 UTC
4 points
1
To generalize:
Minimal squishiness. You probably need something like a neural net in order to create a world-model for the AI to use, but could probably do everything else using carefully reviewed human-written code that “plugs in” to concepts in the world-model. (Probably best to have something coded in for what to do if a concept you plugged into disappears/fragments when the AI gets more information).
Abstract goals. The world-model needs enough detail to be able to point to the right concept (e.g. human-value related goal), but as long as it does so the AI doesn’t necessarily need to know everything about human values, it will just be uncertain and act under uncertainty (which can include risk-aversion measures, asking humans etc.).
Present-groundedness. The AI’s decision-making procedure should not care about the future directly, only via how humans care about the future. Otherwise it e.g. replaces humans with utility monsters.