Nice post! You did an especially good job explaining equations—or at least, good enough for me to get what meant what :P
I also strongly agree with the claim that we should be thinking about aligning model-based reinforcement learning (or at least sorta-reinforcement-learning) agents.
If you read Reducing Goodhart you probably already know the rest of my take, but maybe I should write a simple post that just explains this one thing: we should model humans how they want to be modeled. Locating models of human-like objects in the world model that score highly according to agentiness and explanatory power is a great place to start your imagination, but it doesn’t model humans how they want to be modeled[1].
Modeling humans how they want to be modeled requires feeding information about inferred human preferences back into the model-of-humans selection process itself. It also means that there can be an important distinction between the AI’s most accurate model of the world (best for planning), and the AI’s most human-centric model of the world (best for conforming to human opinions about how our preferences should be modeled)[2].
The criteria that pick out models in this post (and its relatives, including PreDCA or the example in post 2 of Reducing Goodhart) are simple and tractable, but they’re not what I would pick if I had lots of time to interact with this AI, look at how it ends up modeling me, and build tools to help me target it at something I think really “gets me.”
I love the idea of modeling humans how they want to be modeled. I think of this as like a fuzzy pointer to human values, that sharpens itself? But I’m confused about how to implement this, or formalize this process.
I hadn’t seen your sequence, I’m a couple of posts in, it’s great so far. Does it go into formalizing the process you describe?
Does it go into formalizing the process you describe?
Nope, sorry! I’m still at the stage of understanding where formalizing it would mean leaving in a bunch of parameters that hide hard problems (E.g. “a measure of how agent-shaped a model augmented with a rule for extracting preferences is” or “a function that compares plans of action in different ontologies.”), so I didn’t really bother.
But if you’re around Lightcone, hit me up and we can chat and write things on whiteboards.
Nice post! You did an especially good job explaining equations—or at least, good enough for me to get what meant what :P
I also strongly agree with the claim that we should be thinking about aligning model-based reinforcement learning (or at least sorta-reinforcement-learning) agents.
If you read Reducing Goodhart you probably already know the rest of my take, but maybe I should write a simple post that just explains this one thing: we should model humans how they want to be modeled. Locating models of human-like objects in the world model that score highly according to agentiness and explanatory power is a great place to start your imagination, but it doesn’t model humans how they want to be modeled[1].
Modeling humans how they want to be modeled requires feeding information about inferred human preferences back into the model-of-humans selection process itself. It also means that there can be an important distinction between the AI’s most accurate model of the world (best for planning), and the AI’s most human-centric model of the world (best for conforming to human opinions about how our preferences should be modeled)[2].
The criteria that pick out models in this post (and its relatives, including PreDCA or the example in post 2 of Reducing Goodhart) are simple and tractable, but they’re not what I would pick if I had lots of time to interact with this AI, look at how it ends up modeling me, and build tools to help me target it at something I think really “gets me.”
Or you could frame this a different way and figure out how to have the human-preferred structure “live inside” the most accurate model of the world.
I love the idea of modeling humans how they want to be modeled. I think of this as like a fuzzy pointer to human values, that sharpens itself? But I’m confused about how to implement this, or formalize this process.
I hadn’t seen your sequence, I’m a couple of posts in, it’s great so far. Does it go into formalizing the process you describe?
Nope, sorry! I’m still at the stage of understanding where formalizing it would mean leaving in a bunch of parameters that hide hard problems (E.g. “a measure of how agent-shaped a model augmented with a rule for extracting preferences is” or “a function that compares plans of action in different ontologies.”), so I didn’t really bother.
But if you’re around Lightcone, hit me up and we can chat and write things on whiteboards.