Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.
Improve it with respect to what?
My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:
The agent’s allegiance is to some idealized utility function Uideal (like CEV). The agent’s internal evaluator Eval is “trying” to approximate Uideal by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on Uideal. Hence, Eval concludes that Uideal(plan) is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.
“Improving Eval” makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal)
Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy
A simple framework (that probably isn’t strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates “How good do I think this plan is?” and a hindsight evaluation method that calculates “How good was it, really?”. There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is (“The chances of losing my client’s money with this investment strategy were way higher than I thought they were.”) or thinking that some new state will be hindsight-evaluated better than it actually will be (“He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting.”), etc.
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
Do you think that you are internally trying to approximate your own Uideal?
Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don’t trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
Can you think of concrete instances where you improved your own Eval?
Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
Do you think that your own changes to your eval have been moving in the direction of your Uideal?
Improve it with respect to what?
My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:
The agent’s allegiance is to some idealized utility function Uideal (like CEV). The agent’s internal evaluator Eval is “trying” to approximate Uideal by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on Uideal. Hence, Eval concludes that Uideal(plan) is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.
“Improving Eval” makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal)
Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy
Yeah I think you’re on the right track.
A simple framework (that probably isn’t strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates “How good do I think this plan is?” and a hindsight evaluation method that calculates “How good was it, really?”. There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is (“The chances of losing my client’s money with this investment strategy were way higher than I thought they were.”) or thinking that some new state will be hindsight-evaluated better than it actually will be (“He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting.”), etc.
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
Do you think that you are internally trying to approximate your own Uideal?
Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don’t trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
Can you think of concrete instances where you improved your own Eval?
Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
Do you think that your own changes to your eval have been moving in the direction of your Uideal?
Vivek—I replied to your comment in appendix C of today’s follow-up post, Alignment allows imperfect decision-influences and doesn’t require robust grading.