Paul—how widely do you want this shared?
IAFF-User-111
The “benign induction problem” link is broken.
I agree it’s not a complete solution, but it might be a good path towards creating a task-AI, which is a potentially important unsolved sub-problem.
I spoke with Huw about this idea. I was thinking along similar lines at some point, but only for “safe-shutdown”, e.g. if you had a self-driving car that anticipated encountering a dangerous situation and wanted to either:
pull over immediately
cede control to a human operator
It seems intuitive to give it a shutdown policy that triggers in such cases, and that aims to minimize a combined objective of time-to-shutdown and risk-of-shutdown. (Of course, this doesn’t deal with interrupting the agent, ala Armstrong and Orseau.)
Huw pointed out that a similar strategy can be used for any “genie”-style goal (i.e. you want an agent to do one thing as efficiently as possible, and then shut-down until you give it another command), which made me substantially more interested in it.
This seems similar in spirit to giving your agent a short horizon, but now you also have regular terminations, by default, which has some extra pros and cons.
I reason as follows:
Omega inspires belief only after the agent encounters Omega.
According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
the agent does update, in which case, why not update on the result of the coin-flip? or
the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.
I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.
But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.
This seems only loosely related to my OP.
But it is quite interesting… so you’re proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.
Anyways, I think that’s a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.
OK that makes sense, thanks. This is what I suspected, but I was surprised that so many people are saying that UDT gets mugged without stipulating this; it made it seem like I was missing something.
Playing devil’s advocate:
P(mugger) and P(anti-mugger) aren’t the only relevant quantities IRL
I don’t think we know nearly enough to have a good idea of what policy UDT would choose for a given prior. This leads me to doubt the usefulness of UDT.
It’s not the same (but similar), because my proposal is just about learning a model of impact, and has nothing to do with the agent’s utility function.
You could use the learned impact function, , to help measure (and penalize) impact, however.
Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.
I was mostly a gut-feeling when I posted, but let me try and articulate a few:
-
It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you’ve done so doesn’t seem very feasible. Impact may be missed if the representation doesn’t properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.
-
I haven’t looked into it, but ATM I have no theory about when this scheme could be expected to recover the “correct” model (I don’t even know how that would be defined… I’m trying to “learn” my way around the problem :P)
To put #1 another way, I’m not sure that I’ve gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).
On the other hand, I was inspired to consider this idea when thinking about Yoshua’s proposal about causal disentangling mentioned at the end of his Asilomar talk here: https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent’s learning towards maximizing its influence, which might help… although having an agent learn based on maximizing its influence seems like a bad idea… but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...
So then you’d end up with some sort of adversarial-ish set-up, where the agent is trying to both:
maximize potential impact (i.e. by understanding its ability to influence the world)
minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).
Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.
(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this. … and that’s all I have to say ATM-
Thanks, I think I understand that part of the argument now. But I don’t understand how it relates to:
“10. We should expect simple reasoning rules to correctly generalize even for non-learning problems. ”
^Is that supposed to be a good thing or a bad thing? “Should expect” as in we want to find rules that do this, or as in rules will probably do this?
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
Points 5-9 seem to basically be saying: “We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding”.
I don’t really understand point 10, especially this part:
“They would most likely generalize in an unaligned way, since the reasoning rules would likely be contained in some sub-agent (e.g. consider how Earth interpreted as an “agent” only got to the moon by going through reasoning rules implemented by humans, who have random-ish values; Paul’s post on the universal prior also demonstrates this).”
I really agree with #2 (and I think with #1, as well, but I’m not as sure I understand your point there).
I’ve been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn’t seem obvious to most… but I haven’t really considered that “efficient aligned AIs almost certainly exist as points in mindspace”. In fact I’m not sure I agree 100% (basically because “Moloch” (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).
I think “trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed” remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really “takes off”.
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I’d paraphrase what he’s said as:
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
Which I emphatically agree with.
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I’d paraphrase what he’s said as:
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
Which I emphatically agree with.
So after talking w/Stuart, I guess what he means by “humans learning from the AI’s actions” is that what humans’ beliefs about U converges to actually changes (for the better). I’m not sure if that’s really desirable, atm.
On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents’). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.
I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent’s ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).
RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.
P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = … = null).
What do you mean by “in full generality instead of the partial version attained by policy selection”?