Jeremy Gillen comments on Detect Goodhart and shut down

Jeremy Gillen Feb 17, 2025, 8:19 AM
7 points
0
With regards to the agent believing that it’s impossible to influence the probability that its plan passes validation
This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn’t want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn’t mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.
- EJT Feb 17, 2025, 10:44 AM
  3 points
  0
  Parent
  Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.
  Here’s why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can’t lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.
  I’m not sure the medical test is a good analogy. I don’t mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent’s goal is to do what we humans really want. And that’s something we can’t assume, given the difficulty of alignment.
  - Jeremy Gillen Feb 17, 2025, 12:45 PM
    2 points
    0
    Parent
    In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?
    We can’t reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn’t get the desired behaviour. Maybe you have a particular translation method in mind?
    I don’t mess up the medical test because true information is instrumentally useful to me, given my goals.
    Yep that’s what I meant. The goal u is constructed to make information about h instrumentally useful for achieving u, even if g is poorly specified. The agent can prefer h over ~h or vice versa, just as we prefer a particular outcome of a medical test. But because of the instrumental (information) value of the test, we don’t interfere with it.
    I think the utility indifference genre of solutions (which try to avoid preferences between shutdown and not-shutdown) are unnatural and create other problems. My approach allows the agent to shutdown even if it would prefer to be in the non-shutdown world.