Daniel Kokotajlo comments on Evaluating the historical value misspecification argument

Daniel Kokotajlo 7 Dec 2023 19:29 UTC
LW: 2 AF: 2
0
AF
So, IIUC, you are proposing we:
- Literally just query GPT-N about whether [input_outcome] is good or bad
- Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
- Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
- Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?