This analogy will be better for communicating with some people, but I feel like it was the goto at some earlier point, and the evolution analogy was invented to fix some problems with this one.
IE, before “inner alignment” became a big part of the discussion, a common explanation of the alignment problem was essentially what would now be called the outer alignment problem, which is precisely that (seemingly) any goal you write down has smart-alecky misinterpretations which technically do better than the intended interpretation. This is sometimes called nearest unblocked strategy or unforseen maximum or probably other jargon I’m forgetting.
The evolution analogy improves on this in some ways. I think one of the most common objections to the KPI analogy is something along the lines of “why is the AI so devoted to malicious compliance” or “why is the AI so dumb about interpreting what we ask it for”. Some OK answers to this are...
Gradient descent only optimizes the loss function you give it.
The AI only knows what you tell it.
The current dominant ML paradigm is all about minimizing some formally specified loss. That’s all we know how to do.
… But responses like this are ultimately a bit misleading, since (as the Shard-theory people emphasize, and as the evolution analogy attempts to explain) what you get out of gradient descent doesn’t treat loss-minimization as its utility function, and we don’t know how to make AIs which just intelligently optimize some given utility (except in very well-specified problems where learning isn’t needed), and the AI doesn’t only know what you tell it.
So for some purposes, the evolution analogy is superior.
This analogy will be better for communicating with some people, but I feel like it was the goto at some earlier point, and the evolution analogy was invented to fix some problems with this one.
IE, before “inner alignment” became a big part of the discussion, a common explanation of the alignment problem was essentially what would now be called the outer alignment problem, which is precisely that (seemingly) any goal you write down has smart-alecky misinterpretations which technically do better than the intended interpretation. This is sometimes called nearest unblocked strategy or unforseen maximum or probably other jargon I’m forgetting.
The evolution analogy improves on this in some ways. I think one of the most common objections to the KPI analogy is something along the lines of “why is the AI so devoted to malicious compliance” or “why is the AI so dumb about interpreting what we ask it for”. Some OK answers to this are...
Gradient descent only optimizes the loss function you give it.
The AI only knows what you tell it.
The current dominant ML paradigm is all about minimizing some formally specified loss. That’s all we know how to do.
… But responses like this are ultimately a bit misleading, since (as the Shard-theory people emphasize, and as the evolution analogy attempts to explain) what you get out of gradient descent doesn’t treat loss-minimization as its utility function, and we don’t know how to make AIs which just intelligently optimize some given utility (except in very well-specified problems where learning isn’t needed), and the AI doesn’t only know what you tell it.
So for some purposes, the evolution analogy is superior.
And yeah, probably neither analogy is great.