I’m thinking of cases where the programmers tried to write a FAI but they did something slightly wrong.
I’m having trouble coming up with a realistic model of what that would look like. I’m also wondering why aspiring FAI designers didn’t bother to test-run their utility function before actually “running” it in a real optimization process.
I have, but it’s running with the dramatic-but-unrealistic “genie model” of AI, in which you could simply command the machine, “Be a Friendly AI!” or “Be the CEV of humanity!”, and it would do it. In real life, verbal descriptions are mere shorthand for actual mental structures, and porting the necessary mental structures for even the slightest act of direct normativity over from one mind-architecture to another is (I believe) actually harder than just using some form of indirect normativity.
(That doesn’t mean any form of indirect normativity will work rightly, but it does mean that Evil Genie AI is a generalization from fictional evidence.)
Hence my saying I have trouble coming up with a realistic model.
I’m also wondering why aspiring FAI designers didn’t bother to test-run their utility function before actually “running” it in a real optimization process.
Because if you don’t construct a FAI but only construct a seed out of which a FAI will build itself, it’s not obvious that you’ll have the ability to do test runs.
Quite so, which is why I support MIRI despite their marketing techniques being much too fearmongering-laden, in my opinion.
Even though I do understand why they are: Eliezer believes he was dangerously close to actually building an AI before he realized it would destroy the human race, back in the SIAI days. Fair enough on him, being afraid of what all the other People Like Eliezer might do, but without being able to see his AI designs from that period, there’s really no way for the rest of us to judge whether it would have destroyed the human race or just gone kaput like so many other supposed AGI designs. Private experience, however, does not serve as persuasive marketing material.
Hmmm… Upon thinking it over in my spare brain-cycles for a few hours, I’d say the most likely failure mode of an attempted FAI is to extrapolate from the wrong valuation machinery in humans. For instance, you could end up with a world full of things people want and like, but don’t approve. You would thus end up having a lot of fun while simultaneously knowing that everything about it is all wrong and it’s never, ever going to stop.
Of course, that’s just one cell in a 2^3-cell grid, and that’s assuming Yvain’s model of human motivations is accurate enough that FAI designers actually tried to use it, and then hit a very wrong square out of 8 possible squares.
Within that model, I’d say “approving” is what we’re calling the motivational system that imposes moral limits on our behavior, so I would say if you manage to combine wanting and/or liking with a definite +approving, you’ve got a solid shot at something people would consider moral. Ideally, I’d say Friendliness should shoot for +liking/+approving while letting wanting vary. That is, an AI should do things people both like and approve of without regard to whether those people would actually feel motivated enough to do them.
You would thus end up having a lot of fun while simultaneously knowing that everything about it is all wrong and it’s never, ever going to stop.
Are we totally sure this is not what utopia initially feels like from the inside? Because I have to say, that sentence sounded kinda attractive for a second.
What kinds of wierdtopias are you imagining that would fulfill those criteria?
Because the ones that first sprung to mind for me (this might make an interesting exercise for people, actually) were all emphatically, well, wrong. Bad. Unethical. Evil… could you give some examples?
I of course don’t speak for EY, but what I would mean if I made a similar comment would hinge on expecting my experience of “I know that everything about this is all wrong” to correlate with anything that’s radically different from what I was expecting and am accustomed to, whether or not they are bad, unethical, or evil, and even if I would endorse it (on sufficient reflection) more than any alternatives.
Given that I expect my ideal utopia to be radically different from what I was expecting and am accustomed to (because, really, how likely is the opposite?), I should therefore expect to react that way to it initially.
Although I don’t usually include a description of the various models of the other speaker I’m juggling during conversation, that’s my current best guess. However, principle of charity and so forth.
(Plus Eliezer is very good at coming up with wierdtopias—probably better than I am.)
It’s what an ill-designed “utopia” might feel like. Note the link to Yvain’s posting: I’m referring to a “utopia” that basically consists of enforced heroin usage, or its equivalent. Surely you can come up with better things to do than that in five minutes’ thinking.
I’m having trouble coming up with a realistic model of what that would look like. I’m also wondering why aspiring FAI designers didn’t bother to test-run their utility function before actually “running” it in a real optimization process.
Have you read Failed Utopia #4-2?
I have, but it’s running with the dramatic-but-unrealistic “genie model” of AI, in which you could simply command the machine, “Be a Friendly AI!” or “Be the CEV of humanity!”, and it would do it. In real life, verbal descriptions are mere shorthand for actual mental structures, and porting the necessary mental structures for even the slightest act of direct normativity over from one mind-architecture to another is (I believe) actually harder than just using some form of indirect normativity.
(That doesn’t mean any form of indirect normativity will work rightly, but it does mean that Evil Genie AI is a generalization from fictional evidence.)
Hence my saying I have trouble coming up with a realistic model.
Because if you don’t construct a FAI but only construct a seed out of which a FAI will build itself, it’s not obvious that you’ll have the ability to do test runs.
Well, that sounds like a new area of AI safety engineering to explore, no? How to check your work before doing something potentially dangerous?
I believe that is MIRI’s stated purpose.
Quite so, which is why I support MIRI despite their marketing techniques being much too fearmongering-laden, in my opinion.
Even though I do understand why they are: Eliezer believes he was dangerously close to actually building an AI before he realized it would destroy the human race, back in the SIAI days. Fair enough on him, being afraid of what all the other People Like Eliezer might do, but without being able to see his AI designs from that period, there’s really no way for the rest of us to judge whether it would have destroyed the human race or just gone kaput like so many other supposed AGI designs. Private experience, however, does not serve as persuasive marketing material.
Perhaps it had implications that only became clear to a superintelligence?
Hmmm… Upon thinking it over in my spare brain-cycles for a few hours, I’d say the most likely failure mode of an attempted FAI is to extrapolate from the wrong valuation machinery in humans. For instance, you could end up with a world full of things people want and like, but don’t approve. You would thus end up having a lot of fun while simultaneously knowing that everything about it is all wrong and it’s never, ever going to stop.
Of course, that’s just one cell in a 2^3-cell grid, and that’s assuming Yvain’s model of human motivations is accurate enough that FAI designers actually tried to use it, and then hit a very wrong square out of 8 possible squares.
Within that model, I’d say “approving” is what we’re calling the motivational system that imposes moral limits on our behavior, so I would say if you manage to combine wanting and/or liking with a definite +approving, you’ve got a solid shot at something people would consider moral. Ideally, I’d say Friendliness should shoot for +liking/+approving while letting wanting vary. That is, an AI should do things people both like and approve of without regard to whether those people would actually feel motivated enough to do them.
Are we totally sure this is not what utopia initially feels like from the inside? Because I have to say, that sentence sounded kinda attractive for a second.
What kinds of wierdtopias are you imagining that would fulfill those criteria?
Because the ones that first sprung to mind for me (this might make an interesting exercise for people, actually) were all emphatically, well, wrong. Bad. Unethical. Evil… could you give some examples?
I of course don’t speak for EY, but what I would mean if I made a similar comment would hinge on expecting my experience of “I know that everything about this is all wrong” to correlate with anything that’s radically different from what I was expecting and am accustomed to, whether or not they are bad, unethical, or evil, and even if I would endorse it (on sufficient reflection) more than any alternatives.
Given that I expect my ideal utopia to be radically different from what I was expecting and am accustomed to (because, really, how likely is the opposite?), I should therefore expect to react that way to it initially.
Although I don’t usually include a description of the various models of the other speaker I’m juggling during conversation, that’s my current best guess. However, principle of charity and so forth.
(Plus Eliezer is very good at coming up with wierdtopias—probably better than I am.)
It’s what an ill-designed “utopia” might feel like. Note the link to Yvain’s posting: I’m referring to a “utopia” that basically consists of enforced heroin usage, or its equivalent. Surely you can come up with better things to do than that in five minutes’ thinking.