Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.
With my current best formalization, the “objects” in the world are not necessarily localized in 3D space. Indeed, one of the main things which makes an abstraction “natural” is that the relevant information is redundantly represented in many places in the physical world.
“Daytime” is a good example: I can measure light intensity at lots of different places in my general area, at lots of different times, and find that they all strongly correlate. The information about light intensity is redundant across all those locations: if I measure high light intensity outside my house, then I’m pretty confident that a measurement taken outside the office at the same time will also have high intensity. The latent variable representing that redundant information (as a function of time) is what we call “daytime”.
This part of Proof Strategy 1 is a basically-accurate description of what I’m working towards:
We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
… it’s just not necessarily about objects localized in 3D space.
Also, there’s several possible paths, and they don’t all require unambiguous definitions of all the “things” in a human’s ontology. For instance, if corrigibility turns out to be a natural “thing”, that could short-circuit the need for a bunch of other rigorous concepts.
Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:
For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don’t exist in the world but which the AGI imagines (and could potentially create).
If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it’s a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made.
…And in that case, it’s not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about.
I’m mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”.
…And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.
I expect that there will be concepts the AI finds useful which humans don’t already understand. But these concepts should still be of the same type as human concepts—they’re still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of “snow”, but snow is still the kind-of-thing they’re capable of understanding if they’re ever exposed to it. When the AI uses concepts humans don’t already have, I expect them to be like that.
As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.
should be conceptually straightforward to model how humans would reason about those concepts or value them
Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate “caring” reaction(s) and seeing exactly what kinds of thoughts do or don’t trigger it.
Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12.
But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?
We don’t necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.
(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it.
Separately, shouldn’t “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.
An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don’t think to ask the right questions about how the world works. That’s the sort of thing I have in mind.
In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
I think I disagree with this claim. Maybe not exactly as worded—like, sure, maybe the “set of mental activities” involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there’s a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions).
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.”
It sounds to me like you’re imagining something which emulates human reasoning to a much greater extent than I’m imagining.
It’s possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn’t care to tell the human, because the human didn’t ask. But from what you’re saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator—something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct?
If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can’t be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can.
Maybe you’re thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).
I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn’t care to tell the human, because the human didn’t ask. But from what you’re saying now, I guess GPT-N has nothing to do with the story?
Basically, yeah.
The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system’s ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. “With great power comes great responsibility”—cheesy, but it fits.
With my current best formalization, the “objects” in the world are not necessarily localized in 3D space. Indeed, one of the main things which makes an abstraction “natural” is that the relevant information is redundantly represented in many places in the physical world.
“Daytime” is a good example: I can measure light intensity at lots of different places in my general area, at lots of different times, and find that they all strongly correlate. The information about light intensity is redundant across all those locations: if I measure high light intensity outside my house, then I’m pretty confident that a measurement taken outside the office at the same time will also have high intensity. The latent variable representing that redundant information (as a function of time) is what we call “daytime”.
Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?
This part of Proof Strategy 1 is a basically-accurate description of what I’m working towards:
… it’s just not necessarily about objects localized in 3D space.
Also, there’s several possible paths, and they don’t all require unambiguous definitions of all the “things” in a human’s ontology. For instance, if corrigibility turns out to be a natural “thing”, that could short-circuit the need for a bunch of other rigorous concepts.
Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:
For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don’t exist in the world but which the AGI imagines (and could potentially create).
If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it’s a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made.
…And in that case, it’s not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about.
I’m mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”.
…And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.
I expect that there will be concepts the AI finds useful which humans don’t already understand. But these concepts should still be of the same type as human concepts—they’re still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of “snow”, but snow is still the kind-of-thing they’re capable of understanding if they’re ever exposed to it. When the AI uses concepts humans don’t already have, I expect them to be like that.
As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.
Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate “caring” reaction(s) and seeing exactly what kinds of thoughts do or don’t trigger it.
Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12.
But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?
We don’t necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.
(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it.
Separately, shouldn’t “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.
An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don’t think to ask the right questions about how the world works. That’s the sort of thing I have in mind.
In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.
I think I disagree with this claim. Maybe not exactly as worded—like, sure, maybe the “set of mental activities” involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there’s a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions).
It sounds to me like you’re imagining something which emulates human reasoning to a much greater extent than I’m imagining.
It’s possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn’t care to tell the human, because the human didn’t ask. But from what you’re saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator—something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct?
If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can’t be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can.
Maybe you’re thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).
Basically, yeah.
The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system’s ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. “With great power comes great responsibility”—cheesy, but it fits.