This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy