Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what’s small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).
Nevertheless I think this idea looks promising enough to explore further, would also like to hear David’s reasons.
I was mostly a gut-feeling when I posted, but let me try and articulate a few:
It relies on having a good representation. Small problems with the representation might make it unworkable.
Learning a good enough representation and verifying that you’ve done so doesn’t seem very feasible. Impact may be missed if the representation doesn’t properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.
I haven’t looked into it, but ATM I have no theory about when this scheme could be expected to recover the “correct” model (I don’t even know how that would be defined… I’m trying to “learn” my way around the problem :P)
To put #1 another way, I’m not sure that I’ve gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).
On the other hand, I was inspired to consider this idea when thinking about Yoshua’s proposal about causal disentangling mentioned at the end of his Asilomar talk here:
https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent’s learning towards maximizing its influence, which might help… although having an agent learn based on maximizing its influence seems like a bad idea… but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...
So then you’d end up with some sort of adversarial-ish set-up, where the agent is trying to both:
maximize potential impact (i.e. by understanding its ability to influence the world)
minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).
Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D
We want an agent to learn about how to maximize its impact in order to avoid doing so.
(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this.
… and that’s all I have to say ATM
This is a neat idea! I’d be interested to hear why you don’t think it’s satisfying from a safety point of view, if you have thoughts on that.
Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what’s small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).
Nevertheless I think this idea looks promising enough to explore further, would also like to hear David’s reasons.
I was mostly a gut-feeling when I posted, but let me try and articulate a few:
It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you’ve done so doesn’t seem very feasible. Impact may be missed if the representation doesn’t properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.
I haven’t looked into it, but ATM I have no theory about when this scheme could be expected to recover the “correct” model (I don’t even know how that would be defined… I’m trying to “learn” my way around the problem :P)
To put #1 another way, I’m not sure that I’ve gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).
On the other hand, I was inspired to consider this idea when thinking about Yoshua’s proposal about causal disentangling mentioned at the end of his Asilomar talk here: https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent’s learning towards maximizing its influence, which might help… although having an agent learn based on maximizing its influence seems like a bad idea… but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...
So then you’d end up with some sort of adversarial-ish set-up, where the agent is trying to both:
maximize potential impact (i.e. by understanding its ability to influence the world)
minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).
Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.
(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this. … and that’s all I have to say ATM
Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.