Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
That said, your writing makes me wonder “where is the heavy optimization [over the value definitions] coming from?”, since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.