One thing I’d be concerned about is that there are a lot of possible futures that sound really appealing, and that a normal human would sign off on, but are actually terrible (similar concept: siren worlds).
For example, in a world of Christians the AI would score highly on a future where they get to eternally rest and venerate God, which would get really boring after about five minutes. In a world of Rationalists the AI would score highly on a future where they get to live on a volcano island with catgirls, which would also get really boring after about five minutes.
There are potentially lots of futures like this (that might work for a wider range of humans), and because the metric (inferred approval after it’s explained) is different from the goal (whether the future is good) and there’s optimisation pressure increasing with the number of futures considered, I would expect it to be Goodharted.
Some possible questions this raises:
On futures: I can’t store the entire future in my head, so the AI would have to only describe some features. Which features? How to avoid the selection of features determining the outcome?
On people: What if the future involves creating new people, who most people currently would want to live in that future? What about animals? What about babies?
One thing I’d be concerned about is that there are a lot of possible futures that sound really appealing, and that a normal human would sign off on, but are actually terrible (similar concept: siren worlds).
For example, in a world of Christians the AI would score highly on a future where they get to eternally rest and venerate God, which would get really boring after about five minutes. In a world of Rationalists the AI would score highly on a future where they get to live on a volcano island with catgirls, which would also get really boring after about five minutes.
There are potentially lots of futures like this (that might work for a wider range of humans), and because the metric (inferred approval after it’s explained) is different from the goal (whether the future is good) and there’s optimisation pressure increasing with the number of futures considered, I would expect it to be Goodharted.
Some possible questions this raises:
On futures: I can’t store the entire future in my head, so the AI would have to only describe some features. Which features? How to avoid the selection of features determining the outcome?
On people: What if the future involves creating new people, who most people currently would want to live in that future? What about animals? What about babies?