simon comments on Contra “Strong Coherence”

simon 1 Mar 2023 20:38 UTC
2 points
0
Agree on SGCA, if only because something is likely to self-modify to one, disagree on expected utility maximization necessarily being the most productive way to think of it.
Consider the following two hypothetical agents:
Agent 1 follows the deontological rule of choosing the action that maximizes some expected utility function.
Agent 2 maximizes expected utility, where utility is defined as how well an objective god’s-eye-view observer would rate Agent 2′s conformance to some deontological rule.
Obviously agent 1 is more naturally expressed in utilitarian terms, and agent 2 in deontological terms, though both are both and both can be coherent.
Now, when we try to define what decision procedure an aligned AI could follow, it might turn out that there’s no easy way to express what we want it to do in purely utilitarian terms, but it might be easier in some other terms.
I especially think that’s likely to be the case for corrigibility, but also for alignment generally.
- Tamsin Leake 1 Mar 2023 21:19 UTC
  1 point
  0
  Parent
  i mean sure but i’d describe both as utility maximizers because maximizing utility is it fact what they consistently do. Dragon God’s claim seems to be that we wouldn’t get an AI that would be particularly well predicted by utility maximization, and this seems straightforwardly false of agents 1 and 2.
  - simon 1 Mar 2023 21:48 UTC
    1 point
    0
    Parent
    Yes, but:
    If you were trying to design something that acts like agent 2 and were stuck in a mindsight of “it must be maximizing some utility function, let’s just think in utility function terms” you might find it difficult.
    (Side point) I’m not sure how much the arguments in Eliezer’s linked post actually apply outside the consequentialist context, so I’m not convinced that coherence necessarily implies a possible utility function for non-consequentialist agents.
    It might be that the closest thing to what we want that we can actually figure out how to make actually isn’t coherent. In which case we would face a choice between
    making it and hoping that its likely self-modification towards coherence won’t ruin it’s alignment, or
    making something else that is coherent to start with but is less aligned
    While (a) is risky (b) seems worse to me.