First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano’s original write-up of Indirect Normativity did. You simply can’t actually simulate or “acausally peek at” whole universes at a time, or even Earth-volumes in such. We don’t have the compute-power, and I don’t understand how I’m supposed to be seduced by a siren that can’t sing to me.
It seems to me that the greater danger is that a UFAI would simply market itself as an FAI as an instrumental goal and use various “siren and marketing” tactics to manipulate us into cleanly, quietly accepting our own extinction—because it could just be cheaper to manipulate people than to fight them, when you’re not yet capable of making grey goo but still want to kill all humans.
And if we want to talk about complex nasty dangers, it’s probably going to just be people jumping for the first thing that looks eutopian, in the process chucking out some of their value-set. People do that a lot, see: every single so-called “utopian” movement ever invented.
EDIT: Also, I think it makes a good bit of sense to talk about “IC-maximizing” or “marketing worlds” using the plainer machine-learning terminology: overfitting. Overfitting is also a model of what happens when an attempted reinforcement learner or value-learner over non-solipsistic utility functions wireheads itself: the learner has come up with a hypothesis that matches the current data-set exactly (for instance, “pushing my own reward button is Good”) while diverging completely from the target function (human eutopia).
Avoiding overfitting is one very good reason why it’s better to build an FAI by knowing an epistemic procedure that leads to the target function rather than just filtering a large hypothesis space for what looks good.
First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano’s original write-up of Indirect Normativity did.
Two main reasons for this: first, there is Christiano’s original write-up, which has this problem. Second, we may be in a situation where we ask an AI to simulate the consequences of its choice, have a glance at it, and then approve/disapprove. That’s less a search problem, and more the original siren world problem, and we should be aware of the problem.
Second, we may be in a situation where we ask an AI to simulate the consequences of its choice, have a glance at it, and then approve/disapprove. That’s less a search problem, and more the original siren world problem, and we should be aware of the problem.
This sounds extremely counterintuitive. If I have an Oracle AI that I can trust to answer more-or-less verbal requests (defined as: any request or “program specification” too vague for me to actually formalize), why have I not simply asked it to learn, from a large corpus of cultural artifacts, the Idea of the Good, and then explain to me what it has learned (again, verbally)? If I cannot trust the Oracle AI, dear God, why am I having it explore potential eutopian future worlds for me?
Colloquially, this concept is indeed very close to overfitting. But it’s not technically overfitting (“overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship.”), and using the term brings in other connotations. For instance, it may be that the AI needs to use less data to seduce us than it would to produce a genuine eutopia. It’s more that it fits the wrong target function (having us approve its choice vs a “good” choice) rather than fitting it in an overfitted way.
Thanks. My machine-learning course last semester didn’t properly emphasize the formal definition of overfitting, or perhaps I just didn’t study it hard enough.
What I do want to think about here is: is there a mathematical way to talk about what happens when a learning algorithm finds the wrong correlative or causative link among several different possible links between the data set and the target function? Such maths would be extremely helpful for advancing the probabilistic value-learning approach to FAI, as they would give us a way to talk about how we can interact with an agent’s beliefs about utility functions while also minimizing the chance/degree of wireheading.
is there a mathematical way to talk about what happens when a learning algorithm finds the wrong correlative or causative link among several different possible links between the data set and the target function?
That would be useful! A short search gives “bias” as the closest term, which isn’t very helpful.
Unfortunately “bias” in statistics is completely unrelated to what we’re aiming for here.
In ugly, muddy words, what we’re thinking is that we give the value-learning algorithm some sample of observations or world-states as “good”, and possibly some as “bad”, and “good versus bad” might be any kind of indicator value (boolean, reinforcement score, whatever). It’s a 100% guarantee that the physical correlates of having given the algorithm a sample apply to every single sample, but we want the algorithm to learn the underlying causal structure of why those correlates themselves occurred (that is, to model our intentions as a VNM utility function) rather than learn the physical correlates themselves (because that leads to the agent wireheading itself).
Here’s a thought: how would we build a learning algorithm that treats its samples/input as evidence of an optimization process occurring and attempts to learn the goal of that optimization process? Since physical correlates like reward buttons don’t actually behave as optimization processes themselves, this would ferret out the intentionality exhibited by the value-learner’s operator from the mere physical effects of that intentionality (provided we first conjecture that human intentions behave detectably like optimization).
Has that whole “optimization process” and “intentional stance” bit from the LW Sequences been formalized enough for a learning treatment?
http://www.fungible.com/respect/index.html
This looks to be very related to the idea of “Observe someone’s actions. Assume they are trying to accomplish something. Work out what they are trying to accomplish.” Which seems to be what you are talking about.
That looks very similar to what I was writing about, though I’ve tried to be rather more formal/mathematical about it instead of coming up with ad-hoc notions of “human”, “behavior”, “perception”, “belief”, etc. I would want the learning algorithm to have uncertain/probabilistic beliefs about the learned utility function, and if I was going to reason about individual human minds I would rather just model those minds directly (as done in Indirect Normativity).
The most obvious weakness is that such an algorithm could easily detect optimization processes that are acting on us (or, if you believe such things exist, you should believe this algorithm might locate them mistakenly), rather than us ourselves.
I’ve been thinking about this, and I haven’t found any immediately useful way of using your idea, but I’ll keep it in the back of my mind… We haven’t found a good way of identifying agency in the abstract sense (“was cosmic phenonmena X caused by an agent, and if so, which one?” kind of stuff), so this might be a useful simpler problem...
Upon further research, it turns out that preference learning is a field within machine learning, so we can actually try to address this at a much more formal level. That would also get us another benefit: supervised learning algorithms don’t wirehead.
Notably, this fits with our intuition that morality must be “taught” (ie: via labelled data) to actual human children, lest they simply decide that the Good and the Right consists of eating a whole lot of marshmallows.
And if we put that together with a conservation heuristic for acting under moral uncertainty (say: optimize for expectedly moral expected utility, thus requiring higher moral certainty for less-extreme moral decisions), we might just start to make some headway on managing to construct utility functions that would mathematically reflect what their operators actually intend for them to do.
I also have an idea written down in my notebook, which I’ve been refining, that sort of extends from what Luke had written down here. Would it be worth a post?
Keywords? I’ve looked through Wikipedia and the table of contents from my ML textbook, but I haven’t found the right term to research yet. “Learn a causal structure from the data and model the part of it that appears to narrow the future” would in fact be how to build a value-learner, but… yeah.
First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano’s original write-up of Indirect Normativity did. You simply can’t actually simulate or “acausally peek at” whole universes at a time, or even Earth-volumes in such. We don’t have the compute-power, and I don’t understand how I’m supposed to be seduced by a siren that can’t sing to me.
It seems to me that the greater danger is that a UFAI would simply market itself as an FAI as an instrumental goal and use various “siren and marketing” tactics to manipulate us into cleanly, quietly accepting our own extinction—because it could just be cheaper to manipulate people than to fight them, when you’re not yet capable of making grey goo but still want to kill all humans.
And if we want to talk about complex nasty dangers, it’s probably going to just be people jumping for the first thing that looks eutopian, in the process chucking out some of their value-set. People do that a lot, see: every single so-called “utopian” movement ever invented.
EDIT: Also, I think it makes a good bit of sense to talk about “IC-maximizing” or “marketing worlds” using the plainer machine-learning terminology: overfitting. Overfitting is also a model of what happens when an attempted reinforcement learner or value-learner over non-solipsistic utility functions wireheads itself: the learner has come up with a hypothesis that matches the current data-set exactly (for instance, “pushing my own reward button is Good”) while diverging completely from the target function (human eutopia).
Avoiding overfitting is one very good reason why it’s better to build an FAI by knowing an epistemic procedure that leads to the target function rather than just filtering a large hypothesis space for what looks good.
Two main reasons for this: first, there is Christiano’s original write-up, which has this problem. Second, we may be in a situation where we ask an AI to simulate the consequences of its choice, have a glance at it, and then approve/disapprove. That’s less a search problem, and more the original siren world problem, and we should be aware of the problem.
This sounds extremely counterintuitive. If I have an Oracle AI that I can trust to answer more-or-less verbal requests (defined as: any request or “program specification” too vague for me to actually formalize), why have I not simply asked it to learn, from a large corpus of cultural artifacts, the Idea of the Good, and then explain to me what it has learned (again, verbally)? If I cannot trust the Oracle AI, dear God, why am I having it explore potential eutopian future worlds for me?
Because I haven’t read Less Wrong? ^_^
This is another argument against using constrained but non-friendly AI to do stuff for us...
Colloquially, this concept is indeed very close to overfitting. But it’s not technically overfitting (“overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship.”), and using the term brings in other connotations. For instance, it may be that the AI needs to use less data to seduce us than it would to produce a genuine eutopia. It’s more that it fits the wrong target function (having us approve its choice vs a “good” choice) rather than fitting it in an overfitted way.
Thanks. My machine-learning course last semester didn’t properly emphasize the formal definition of overfitting, or perhaps I just didn’t study it hard enough.
What I do want to think about here is: is there a mathematical way to talk about what happens when a learning algorithm finds the wrong correlative or causative link among several different possible links between the data set and the target function? Such maths would be extremely helpful for advancing the probabilistic value-learning approach to FAI, as they would give us a way to talk about how we can interact with an agent’s beliefs about utility functions while also minimizing the chance/degree of wireheading.
That would be useful! A short search gives “bias” as the closest term, which isn’t very helpful.
Unfortunately “bias” in statistics is completely unrelated to what we’re aiming for here.
In ugly, muddy words, what we’re thinking is that we give the value-learning algorithm some sample of observations or world-states as “good”, and possibly some as “bad”, and “good versus bad” might be any kind of indicator value (boolean, reinforcement score, whatever). It’s a 100% guarantee that the physical correlates of having given the algorithm a sample apply to every single sample, but we want the algorithm to learn the underlying causal structure of why those correlates themselves occurred (that is, to model our intentions as a VNM utility function) rather than learn the physical correlates themselves (because that leads to the agent wireheading itself).
Here’s a thought: how would we build a learning algorithm that treats its samples/input as evidence of an optimization process occurring and attempts to learn the goal of that optimization process? Since physical correlates like reward buttons don’t actually behave as optimization processes themselves, this would ferret out the intentionality exhibited by the value-learner’s operator from the mere physical effects of that intentionality (provided we first conjecture that human intentions behave detectably like optimization).
Has that whole “optimization process” and “intentional stance” bit from the LW Sequences been formalized enough for a learning treatment?
http://www.fungible.com/respect/index.html This looks to be very related to the idea of “Observe someone’s actions. Assume they are trying to accomplish something. Work out what they are trying to accomplish.” Which seems to be what you are talking about.
That looks very similar to what I was writing about, though I’ve tried to be rather more formal/mathematical about it instead of coming up with ad-hoc notions of “human”, “behavior”, “perception”, “belief”, etc. I would want the learning algorithm to have uncertain/probabilistic beliefs about the learned utility function, and if I was going to reason about individual human minds I would rather just model those minds directly (as done in Indirect Normativity).
I will think about this idea...
The most obvious weakness is that such an algorithm could easily detect optimization processes that are acting on us (or, if you believe such things exist, you should believe this algorithm might locate them mistakenly), rather than us ourselves.
I’ve been thinking about this, and I haven’t found any immediately useful way of using your idea, but I’ll keep it in the back of my mind… We haven’t found a good way of identifying agency in the abstract sense (“was cosmic phenonmena X caused by an agent, and if so, which one?” kind of stuff), so this might be a useful simpler problem...
Upon further research, it turns out that preference learning is a field within machine learning, so we can actually try to address this at a much more formal level. That would also get us another benefit: supervised learning algorithms don’t wirehead.
Notably, this fits with our intuition that morality must be “taught” (ie: via labelled data) to actual human children, lest they simply decide that the Good and the Right consists of eating a whole lot of marshmallows.
And if we put that together with a conservation heuristic for acting under moral uncertainty (say: optimize for expectedly moral expected utility, thus requiring higher moral certainty for less-extreme moral decisions), we might just start to make some headway on managing to construct utility functions that would mathematically reflect what their operators actually intend for them to do.
I also have an idea written down in my notebook, which I’ve been refining, that sort of extends from what Luke had written down here. Would it be worth a post?
Hi, there appears to be a lot of work on learning causal structure from data.
Keywords? I’ve looked through Wikipedia and the table of contents from my ML textbook, but I haven’t found the right term to research yet. “Learn a causal structure from the data and model the part of it that appears to narrow the future” would in fact be how to build a value-learner, but… yeah.
EDIT: One of my profs from undergrad published a paper last year about causal-structure. The question is how useful it is for universal AI applications. Joshua Tenenbaum tackled it from the cog-sci angle in 2011, but again, I’m not sure how to transfer it over to the UAI angle. I was searching for “learning causal structure from data”—herp, derp.
Who was this prof?
I was referring to David Jensen, who taught “Research Methods in Empirical Computer Science” my senior year.
Thanks.