While this is the agenda that Stuart talks most about, other work also happens at CHAI, especially on multiagent scenarios (whether multiple humans or multiple AI systems). See also the ARCHES agenda.
The reason I’m excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting. According to me, the specific algorithm and technical results about pedagogy in the paper should be taken as examples of what the formalism allows you to do. They are interesting results, but certainly aren’t striking at the core of AI alignment. The Benefits of Assistance paper is a bit more clear on the more general benefits of assistance. I think most of CHAI has a similar view to me on this.
All models are wrong; some are useful. I expect that there will always be misspecification in any kind of system that we build. So when I hear “X is misspecified, so it might misbehave”; I want to hear more about how exactly it will misbehave before I’m convinced I should care.
Nonetheless, I do agree that a strict agent assumption seems bad; most notably it seems hard to model the fact that human preferences change (unless you adopt a very expressive model of “preferences”, in which case the agent learns complicated conditionals like “Alice prefers sweet things in the decade 2000-2010 and healthy things in the decade 2010-2020″ that may not generalize very well).
I find the rainforest example not very compelling—it seems to me that to the extent “help the rainforest” means anything to me, it’s because I can model the rainforest as an agent and figure out what it is “trying to do”, and then help it with that. More generally, it seems like “help X” or “assist X” only means something when you view X as pursuing some goal. You could interpret “help the rainforest” as “do the things the environmentalists want you to do”, but that seems to be about human desires, and not an “objective fact” about what it means to help a rainforest. (It does seem plausible to me that the specific mathematical formalism of optimizing a reward function would not be a good fit for the rainforest; that’s different from saying that you shouldn’t view the rainforest as an agent altogether.)
While this is the agenda that Stuart talks most about, other work also happens at CHAI
Yes good point—I’ll clarify and link to ARCHES.
The reason I’m excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting … There should soon be a paper that more directly explains the case for the formalism
Yeah this is a helpful perspective, and great to hear re upcoming paper. I have definitely spoken to some folks that think of CHAI as the “cooperative inverse reinforcement learning lab” so I wanted to make the point that CIRL != CHAI.
All models are wrong; some are useful
Well keep in mind that we’re using the agent model twice: once in our own understanding of the AI systems we build, and then a second time in the AI system’s understanding of what a human is. We can update the former as needed, but if we want the AI system to be able to update its understanding of what a human is then we need to work out how to make that assumption updateable in the algorithms we deploy.
So when I hear “X is misspecified, so it might misbehave”; I want to hear more about how exactly it will misbehave before I’m convinced I should care.
Very fair request. I will hopefully be writing more on this topic in the specific case of the agent assumption soon.
More generally, it seems like “help X” or “assist X” only means something when you view X as pursuing some goal
Well would you agree that it’s possible to help a country? A country seems pretty far away from being an agent, although perhaps it could be said to have goals. Yet it does seem possible to provide e.g. economic advice or military assistance to a country in a way that helps country without simply helping each of the separate individuals.
How about helping some primitive organism, such as a jellyfish or amoeba? I guess you could impute goals onto such organisms...
How about helping a tree? It actually seems pretty straightforward to me how to help a tree (bring water and nutrients to it, clean off parasites from the bark, cut away any dead branches), but does an individual tree really have goals?
Now that I’ve read your post on optimization, I’d restate
More generally, it seems like “help X” or “assist X” only means something when you view X as pursuing some goal.
as
More generally, it seems like “help X” or “assist X” only means something when you view X as an optimizing system.
Which I guess was your point in the first place, that we should view things as optimizing systems and not agents. (Whereas when I hear “agent” I usually think of something like what you call an “optimizing system”.)
I think my main point is that “CHAI’s agenda depends strongly on an agent assumption” seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.
I think my main point is that “CHAI’s agenda depends strongly on an agent assumption” seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.
Ah, very interesting, yeah I agree this seems plausible, and also this is very encouraging to me!
In all of the “help X” examples you give, I do feel like it’s reasonable to do it via taking an intentional stance towards X, e.g. a tree by default takes in water + nutrients through its roots and produces fruit and seeds, in a way that wouldn’t happen “randomly”, and so “helping a tree” means “causing the tree to succeed more at taking in water + nutrients and producing fruit + seeds”.
In the case of a country, I think I would more say “whatever the goal of a country, since the country knows how to use money / military power, that will likely help with its goal, since money + power are instrumental subgoals”. This is mostly a shortcut; ideally I’d figure out what the country’s “goal” is and then assist with that, but that’s very difficult to do because a country is very complex.
I’m wondering if the Rainforest thing is somehow tied to some other disagreements (between you/me or you/MIRI-cluster).
Where, something like “the fact that it requires some interpretive labor to model the Rainforest as an agent in the first place” is related to why it seems hard to be helpful to humans, i.e. humans aren’t actually agents. You get an easier starting ground since we have the ability to write down goals and notice inconsistencies in them, but that’s not actually that reliable. We are not in fact agents and we need to somehow build AIs that reliable seem good to us anyway.
(Curious if this feels relevant either to Rohin, or other “MIRI cluster” folk)
Well, yes, one way to help some living entity is to (1) interpret it as an agent, and then (2) act in service of the terminal goals of that agent. But that’s not the only way to be helpful. It may also be possible to directly be helpful to a living entity that is not an agent, without getting any agent concepts involved at all.
I definitely don’t know how to do this, but the route that avoids agent models entirely seems more plausible me compared to working hard to interpret everything using some agent model that is often a really poor fit, and then helping on the basis of a that poorly-fitting agent model.
I’m excited about inquiring deeply into what the heck “help” means. (All please reach out to me if you’d like to join a study seminar on this topic)
How about helping a tree? It actually seems pretty straightforward to me how to help a tree
Yes, there is interpretive labor, and yes, things become fuzzy as situations become more and more extreme, but if you want to help an agent-ish thing it shouldn’t be too hard to add some value and not cause massive harm.
I expect MIRI-cluster to agree with this point—think of the sentiment “the AI knows what you want it to do, it just doesn’t care”. The difficulty isn’t in being competent enough to help humans, it’s in being motivated to help humans. (If you thought that we had to formally define everything and prove theorems w.r.t the formal definitions or else we’re doomed, then you might think that the fact that humans aren’t clear agents poses a problem; that might be one way that MIRI-cluster and I disagree.)
I could imagine that for some specific designs for AI systems you could say that they would fail to help humans because they make a false assumption of too-much-agentiness. If the plan was “literally run an optimal strategy pair for an assistance game (CIRL)”, I think that would be a correct critique—most egregiously, CIRL assumes a fixed reward function, but humans change over time. But I don’t see why it would be true for the “default” intelligent AI system.
Great summary! Some quick notes:
While this is the agenda that Stuart talks most about, other work also happens at CHAI, especially on multiagent scenarios (whether multiple humans or multiple AI systems). See also the ARCHES agenda.
The reason I’m excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting. According to me, the specific algorithm and technical results about pedagogy in the paper should be taken as examples of what the formalism allows you to do. They are interesting results, but certainly aren’t striking at the core of AI alignment. The Benefits of Assistance paper is a bit more clear on the more general benefits of assistance. I think most of CHAI has a similar view to me on this.
All models are wrong; some are useful. I expect that there will always be misspecification in any kind of system that we build. So when I hear “X is misspecified, so it might misbehave”; I want to hear more about how exactly it will misbehave before I’m convinced I should care.
Nonetheless, I do agree that a strict agent assumption seems bad; most notably it seems hard to model the fact that human preferences change (unless you adopt a very expressive model of “preferences”, in which case the agent learns complicated conditionals like “Alice prefers sweet things in the decade 2000-2010 and healthy things in the decade 2010-2020″ that may not generalize very well).
I find the rainforest example not very compelling—it seems to me that to the extent “help the rainforest” means anything to me, it’s because I can model the rainforest as an agent and figure out what it is “trying to do”, and then help it with that. More generally, it seems like “help X” or “assist X” only means something when you view X as pursuing some goal. You could interpret “help the rainforest” as “do the things the environmentalists want you to do”, but that seems to be about human desires, and not an “objective fact” about what it means to help a rainforest. (It does seem plausible to me that the specific mathematical formalism of optimizing a reward function would not be a good fit for the rainforest; that’s different from saying that you shouldn’t view the rainforest as an agent altogether.)
Yes good point—I’ll clarify and link to ARCHES.
Yeah this is a helpful perspective, and great to hear re upcoming paper. I have definitely spoken to some folks that think of CHAI as the “cooperative inverse reinforcement learning lab” so I wanted to make the point that CIRL != CHAI.
Well keep in mind that we’re using the agent model twice: once in our own understanding of the AI systems we build, and then a second time in the AI system’s understanding of what a human is. We can update the former as needed, but if we want the AI system to be able to update its understanding of what a human is then we need to work out how to make that assumption updateable in the algorithms we deploy.
Very fair request. I will hopefully be writing more on this topic in the specific case of the agent assumption soon.
Well would you agree that it’s possible to help a country? A country seems pretty far away from being an agent, although perhaps it could be said to have goals. Yet it does seem possible to provide e.g. economic advice or military assistance to a country in a way that helps country without simply helping each of the separate individuals.
How about helping some primitive organism, such as a jellyfish or amoeba? I guess you could impute goals onto such organisms...
How about helping a tree? It actually seems pretty straightforward to me how to help a tree (bring water and nutrients to it, clean off parasites from the bark, cut away any dead branches), but does an individual tree really have goals?
Now that I’ve read your post on optimization, I’d restate
as
Which I guess was your point in the first place, that we should view things as optimizing systems and not agents. (Whereas when I hear “agent” I usually think of something like what you call an “optimizing system”.)
I think my main point is that “CHAI’s agenda depends strongly on an agent assumption” seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.
Ah, very interesting, yeah I agree this seems plausible, and also this is very encouraging to me!
In all of the “help X” examples you give, I do feel like it’s reasonable to do it via taking an intentional stance towards X, e.g. a tree by default takes in water + nutrients through its roots and produces fruit and seeds, in a way that wouldn’t happen “randomly”, and so “helping a tree” means “causing the tree to succeed more at taking in water + nutrients and producing fruit + seeds”.
In the case of a country, I think I would more say “whatever the goal of a country, since the country knows how to use money / military power, that will likely help with its goal, since money + power are instrumental subgoals”. This is mostly a shortcut; ideally I’d figure out what the country’s “goal” is and then assist with that, but that’s very difficult to do because a country is very complex.
I’m wondering if the Rainforest thing is somehow tied to some other disagreements (between you/me or you/MIRI-cluster).
Where, something like “the fact that it requires some interpretive labor to model the Rainforest as an agent in the first place” is related to why it seems hard to be helpful to humans, i.e. humans aren’t actually agents. You get an easier starting ground since we have the ability to write down goals and notice inconsistencies in them, but that’s not actually that reliable. We are not in fact agents and we need to somehow build AIs that reliable seem good to us anyway.
(Curious if this feels relevant either to Rohin, or other “MIRI cluster” folk)
Well, yes, one way to help some living entity is to (1) interpret it as an agent, and then (2) act in service of the terminal goals of that agent. But that’s not the only way to be helpful. It may also be possible to directly be helpful to a living entity that is not an agent, without getting any agent concepts involved at all.
I definitely don’t know how to do this, but the route that avoids agent models entirely seems more plausible me compared to working hard to interpret everything using some agent model that is often a really poor fit, and then helping on the basis of a that poorly-fitting agent model.
I’m excited about inquiring deeply into what the heck “help” means. (All please reach out to me if you’d like to join a study seminar on this topic)
I share Alex’s intuition in a sibling comment:
Yes, there is interpretive labor, and yes, things become fuzzy as situations become more and more extreme, but if you want to help an agent-ish thing it shouldn’t be too hard to add some value and not cause massive harm.
I expect MIRI-cluster to agree with this point—think of the sentiment “the AI knows what you want it to do, it just doesn’t care”. The difficulty isn’t in being competent enough to help humans, it’s in being motivated to help humans. (If you thought that we had to formally define everything and prove theorems w.r.t the formal definitions or else we’re doomed, then you might think that the fact that humans aren’t clear agents poses a problem; that might be one way that MIRI-cluster and I disagree.)
I could imagine that for some specific designs for AI systems you could say that they would fail to help humans because they make a false assumption of too-much-agentiness. If the plan was “literally run an optimal strategy pair for an assistance game (CIRL)”, I think that would be a correct critique—most egregiously, CIRL assumes a fixed reward function, but humans change over time. But I don’t see why it would be true for the “default” intelligent AI system.