Max Harms comments on 3b. Formal (Faux) Corrigibility

Max Harms 3 Aug 2024 16:24 UTC
LW: 1 AF: 1
0
AF
That’s an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there’s a Propogandist who gives resources to agents that brainwash their principals into having certain values. If “teach me about philosophy” comes with an influence budget, it seems critical that the AI doesn’t spend that budget trading with Propagandist, and instead does so in a more “central” way.
Still, the idea of instructions carrying a degree of approved influence seems promising.
- FireStormOOO 4 Aug 2024 0:00 UTC
  1 point
  1
  Parent
  Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of “how much influence”, but also along whatever other axes the sort of influence could vary?
  I think if the agent’s action space is still so unconstrained there’s room to consider benefit or harm that flows through principle value modification it’s probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).
  At the same time, it’s not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it’s free money? [e.g. Prince’s insurance company wants to bribe him to stop smoking.]