Basically, the AI does the following:
Create a list of possible futures that it could cause.
For each person and future at the time of the AI’s activation:
1. Simulate convincing that person that the future is going to happen.
2. If the person would try to help the AI, add 1 to the utility of that future, and if the person would try to stop the AI, subtract 1 from the utility of that future.
Cause the future with the highest utility
The usual weaknesses:
how would the AI describe the future? different descriptions of the same future may elicit opposite reactions;
what about things beyond current human understanding? how is the simulated person going to decide whether they are good or bad?
And the new one:
the “this future is going to happen anyway, now I will observe your actions” approach would give high score e.g. to futures that are horrible but everyone who refuses to cooperate with the omnipotent AI will suffer even worse fate (because as long at the threat seems realistic and the AI unstoppable, it makes sense for the simulated person to submit and help)
EDIT: Probably even higher score for futures that are “meh but kinda okay, only everyone who refuses to help (after being explicitly told that refusing to help is punished by horrible torture) is tortured horribly”. The fact that the futures are “kinda okay” and that only people ignoring an explicit warning are tortured, would give an excuse to the simulated person, so fewer of them would choose to become martyrs and thereby provide the −1 vote.
Especially if the simulated person would be told that actually, so far, everyone chose to help, so no one is in fact tortured, but the AI still has a strong precommitment to follow the rules if necessary.
I think if you were to spend time fleshing this out, operationalizing it and thinking of how to handle various edge cases (or not-so-edge-cases), you’d probably end up with something closer to Coherent Extrapolated Volition.
The most obvious issue I see here is that “list all possible futures and then simulate talking to each person” is pretty computationally intractable.
That was never fleshed out itself.
One thing I’d be concerned about is that there are a lot of possible futures that sound really appealing, and that a normal human would sign off on, but are actually terrible (similar concept: siren worlds).
For example, in a world of Christians the AI would score highly on a future where they get to eternally rest and venerate God, which would get really boring after about five minutes. In a world of Rationalists the AI would score highly on a future where they get to live on a volcano island with catgirls, which would also get really boring after about five minutes.
There are potentially lots of futures like this (that might work for a wider range of humans), and because the metric (inferred approval after it’s explained) is different from the goal (whether the future is good) and there’s optimisation pressure increasing with the number of futures considered, I would expect it to be Goodharted.
Some possible questions this raises:
On futures: I can’t store the entire future in my head, so the AI would have to only describe some features. Which features? How to avoid the selection of features determining the outcome?
On people: What if the future involves creating new people, who most people currently would want to live in that future? What about animals? What about babies?
The “convincing” part here seems underspecified. Even smart people can be persuaded by good enough persuaders to join a cult and commit collective suicide, so I don’t think that just because the AI is able to convince someone to help it, that that would be major evidence that the AI was actually aligned.
This AI wouldn’t be trying to convince a human to help it, just that it’s going to succeed.
So instead of convincing humans that a hell-world is good, it would convince the humans that it was going to create a hell-world (and they would all disapprove, so it would score low).
I think what this ends up doing is having everyone agree with a world that sounds superficially good but is actually terrible in a way that’s difficult for unaided humans to realize e.g. the AI convinces everyone that it will create an idyllic natural world where people live forager lifestyles in harmony etc. etc., everyone approves because they like nature and harmony and stuff, it proceeds to create such an idyllic natural world, and wild animal suffering outweighs human enjoyment forevermore.
A hole big enough that it seems too obvoius to point out. “Climate is going to change”, “well duh”, human helped AI to convince human that climate change is going to happen, +1.
I would assume that the AI would be asking a “do you want me to bring this about?”. The stopping might need to be relevant how it is perceived that the change happens. For example if thew AI convinced that human is making climate change happen they might object to climate change but might have psychological diffculty in resisting themselfs.
There is also the issue that if you are convinced that something is happening then resistance is futile. For sensible resistance to be manifest it needs to (seem that) not be too late to affect the thing. Which means the looming of the effect can’t be near inevitability. If you are convinced that atom boms will fall into the ground in 5 minutes you think of cool last words not how to object to that (but the function would count this as a plus).
Say there is one person that a lot of other persons hate. If you were to gather everybody to vote whether to exile or murder that person people could vote one way. Now have everyone approve on the simulated future where he is dead. Aggregating the “uncaused” effects might lead to death verdict where a self-concious decision process would not give such a verdict.