Wei Dai comments on Don’t align agents to evaluations of plans

Wei Dai 27 Nov 2022 17:15 UTC
LW: 8 AF: 8
1
AF
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?

My own framework is something like this:
- The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
- I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
- I can try to improve my evaluation process by doing things like
  1. look for patterns in my and other people’s mistakes
  2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
  3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
  4. talk (selectively) to other people
  5. try to improve how I do explicit reasoning or philosophy