If your measurement/reward is based on “knows the password” or “gives correct password”, you’re measuring a very poor proxy for what you actually want, more like “only people I’ve authorized to access it”. Harder to encode and measure, but also harder to game.
see https://en.wikipedia.org/wiki/Goodhart%27s_law
If your measurement/reward is based on “knows the password” or “gives correct password”, you’re measuring a very poor proxy for what you actually want, more like “only people I’ve authorized to access it”. Harder to encode and measure, but also harder to game.
All forms of wireheading are variants of Goodhart’s law.