2.1.1: A lot of my values and knowledge are implicit. Why should I trust my assistant to be able to learn my values well enough to assist me?
Imagine a question-answering system trained on all the data on Wikipedia, that ends up with comprehensive, gears-level world-models, which it can use to synthesize existing information to answer novel questions about social interactions or what our physical world is like. (Think Wolfram|Alpha, but much better.)
This system is something like a proto-AGI. We can easily restrict it (for example by limiting how long it gets to reflect when it answers questions) so that we can train it to be corrigible while trusting that it’s too limited to do anything dangerous that the overseer couldn’t recognize as dangerous. We use such a restricted system to start off the iterated distillation and amplification process, and bootstrap it to get systems of arbitrarily high capabilities.
2.1.2: OK, sure, but it’ll essentially still be an alien and get lots of minutiae about our values wrong.
How bad is it really if it gets minutiae wrong, as long as it doesn’t cause major catastrophes? Major catastrophes (like nuclear wars) are pretty obvious, and we would obviously disapprove of actions that lead us to catastrophe. So long as it learns to avoid those (which it will, if we give it the right training data), we’re fine.
Also keep in mind that we’re training it to be corrigible, which means it’ll be very cautious about what sorts of things we’d consider catastrophic, and try very hard to avoid them.
2.1.3: But it might make lots of subtle mistakes that add up to something catastrophic!
And so might we. Maybe there are some classes of subtle mistakes the AI will be more prone to than we are, but there are probably also classes of subtle mistakes we’ll be more prone to than the AI. We’re only shooting for our assistant to avoid trying to lead us to a catastrophic outcome.
2.1.4: I’m really not sold that training it to avoid catastrophes and training it to be corrigible will be good enough.
This is actually more a capabilities question (is our system good enough at trying very hard to avoid catastrophes to actually avoid a catastrophe?) than an alignment question. A major open question in Paul’s agenda is how we can formalize performance guarantees well enough to state actual worst-case guarantees.
2.2.1. What sorts of cognition will our assistants be able to perform?
We should roughly expect it to think in ways that would be approved by an HCH (short for “human consulting HCH”). To describe HCHs, let me start by describing a weak HCH:
Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.
That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…
I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes (who can break down those questions and pass them along to other humans-in-boxes, etc.) and get back answers instantaneously. A few remarks:
This formalism tries to capture some notion of “what would H think about some topic if H thought about it for arbitrarily long amounts of time”? For example, H might make partial progress on some question, and then share this progress with some other H and ask it to make more progress, who might do the same.
A weak HCH could simulate the cognitive labor of an economy the size of the US economy. After all, a weak HCH can emulate a single human thinking for a long time, so it can emulate teams of humans thinking for a long time, and thus teams of teams of humans thinking for a long time, etc. If you imagine a corporation as teams of teams of teams of humans performing cognitive labor, you get that a weak HCH can emulate the output of an arbitrary corporation, and thus collections of arbitrary corporations communicating with one another.
Many tasks that don’t intuitively seem like they can be broken down, can in fact be fairly substantially broken down. For example, making progress on difficult math problems seems difficult to break down. But you could break down progress on a math problem into something like (think for a while about possible angles of attack) + (try each angle of attack, and recurse on the new math problem). And (think for a while about possible angles of attack) can be reduced into (look at features of this problem and see if you’ve solved anything similar), which can be reduced into focusing on specific features, and so on.
Strong HCH, or just HCH, is a variant of weak HCHs where the agents-in-boxes are able to communicate with each other directly, and read and write to some shared external memory, in addition to being able to ask, answer, and break down questions. Note that they would be able to implement arbitrary Turing machines this way, and thus avoid any limits on cognition imposed by the structure of weak HCH.
(Note: most people think “HCH” refers to “weak HCH”, but whenever Paul mentions HCHs, he now refers to strong HCHs.)
The exact relationship between HCH and the agents produced through iterated amplification and distillation is confusing and very commonly misunderstood:
HCHs should not be visualized as having humans in the box. They should be thought of as having some corrigible assistant inside the box, much like the question-answering system described in 2.1.1.
Throughout the iterated amplification and distillation process, there is never any agent whose cognition resembles an HCH of the corrigible assistant. In particular, agents produced via distillation are general RL agents with no HCH-like constraints on their cognition. The closest resemblance to HCH appears during amplification, during which a superagent (formed out of copies of the agent getting amplified) performs tasks by breaking them down and distributing them among the agent copies.
(As of the time of this writing, I am still confused about the sense in which the agent’s cognition is approved by an HCH, and what that means about the agent’s capabilities.)
2.2.2. Why should I think the HCH of some simple question-answering AI assistant can perform arbitrarily complex cognition?
All difficult and creative insights stem from chains of smaller and easier insights. So long as our first AI assistant is a universal reasoner (i.e., it can implement arbitrary Turing machines via reflection), it should be able to realize arbitrarily complex things if it reflects for long enough. For illustration, Paul thinks that chimps aren’t universal reasoners, and that most humans past some intelligence threshold are universal.
If this seems counterintuitive, I’d claim it’s because we have poor intuitions around what’s achievable with 2,000,000,000 years of reflection. For example, it might seem that an IQ 120 person, knowing no math beyond arithmetic, would simply be unable to prove Fermat’s last theorem given arbitrary amounts of time. But if you buy that:
An IQ 180 person could, in 2,000 years, prove Fermat’s last theorem knowing nothing but arithmetic (which seems feasible, given that most mathematical progress was made by people with IQs under 180)
An IQ 160 person could, in 100 years, make the intellectual progress an IQ 180 person could in 1 year
An IQ 140 person could, in 100 years, make the intellectual progress an IQ 160 person could in 1 year
An IQ 120 person could, in 100 years, make the intellectual progress an IQ 140 person could in 1 year
Then it follows that an IQ 120 person could prove Fermat’s last theorem in 2,000*100*100*100 = 2,000,000,000 years’ worth of reflection.
2.2.3. Different reasoners can reason in very different ways and reach very different conclusions. Why should I expect my amplified assistant to reason anything like me, or reach conclusions that I’d have reached?
You shouldn’t expect it to reason anything like you, you shouldn’t expect it to reach the conclusions you’d reach, and you shouldn’t expect it to realize everything you’d consider obvious (just like you wouldn’t realize everything it would consider obvious). You should expect it to reason in ways you approve of, which should constrain its reasoning to be sensible and competent, as far as you can tell.
The goal isn’t to have an assistant that can think like you or realize everything you’d realize. The goal is to have an assistant who can think in ways that you consider safe and substantially helpful.
2.2.4. HCH seems to depend critically on being able to break down arbitrary tasks into subtasks. I don’t understand how you can break down tasks that are largely intuitive or perceptual, like playing Go very well, or recognizing images.
Go is actually fairly straightforward: an HCH can just perform an exponential tree search. Iterated amplification and distillation applied to Go is not actually that different from how AlphaZero trains to play Go.
Image recognition is harder, but to the extent that humans have clear concepts of visual features they can reference within images, the HCH should be able to focus on those features. The cat vs. dog debate in Geoffrey Irving’s approach to AI safety via debate gives some illustration of this.
Things get particularly tricky when humans are faced with a task they have little explicit knowledge about, like translating sentences between languages. Paul did mention something like “at some point, you’ll probably just have to stick with relying on some brute statistical regularity, and just use the heuristic that X commonly leads to Y, without being able to break it down further”.
(See: Wei Dai’s comment on Can Corrigibility be Learned Safely, and Paul’s responseto a different comment by Wei Dai on the topic.)
2.2.5: What about tasks that require significant accumulation of knowledge? For example, how would the HCH of a human who doesn’t know calculus figure out how to build a rocket?
This sounds difficult for weak HCHs on their own to overcome, but possible for strong HCHs to overcome. The accumulated knowledge would be represented in the strong HCHs shared external memory, and the humans essentially act as “workers” implementing a higher-level cognitive system, much like ants in an ant colony. (I’m still somewhat confused about what the details of this would entail, and am interested in seeing a more fleshed out implementation.)
2.2.6: It seems like this capacity to break tasks into subtasks is pretty subtle. How does the AI learn to do this? And how do we find human operators (besides Paul) who are capable of doing this?
Ought is gathering empirical data about task decomposition. If that proves successful, Ought will have numerous publicly available examples of humans breaking down tasks.
2. Will such systems still be useful?
2.1. Can the system be both safe and useful?
2.1.1: A lot of my values and knowledge are implicit. Why should I trust my assistant to be able to learn my values well enough to assist me?
Imagine a question-answering system trained on all the data on Wikipedia, that ends up with comprehensive, gears-level world-models, which it can use to synthesize existing information to answer novel questions about social interactions or what our physical world is like. (Think Wolfram|Alpha, but much better.)
This system is something like a proto-AGI. We can easily restrict it (for example by limiting how long it gets to reflect when it answers questions) so that we can train it to be corrigible while trusting that it’s too limited to do anything dangerous that the overseer couldn’t recognize as dangerous. We use such a restricted system to start off the iterated distillation and amplification process, and bootstrap it to get systems of arbitrarily high capabilities.
(See: Automated assistants)
2.1.2: OK, sure, but it’ll essentially still be an alien and get lots of minutiae about our values wrong.
How bad is it really if it gets minutiae wrong, as long as it doesn’t cause major catastrophes? Major catastrophes (like nuclear wars) are pretty obvious, and we would obviously disapprove of actions that lead us to catastrophe. So long as it learns to avoid those (which it will, if we give it the right training data), we’re fine.
Also keep in mind that we’re training it to be corrigible, which means it’ll be very cautious about what sorts of things we’d consider catastrophic, and try very hard to avoid them.
2.1.3: But it might make lots of subtle mistakes that add up to something catastrophic!
And so might we. Maybe there are some classes of subtle mistakes the AI will be more prone to than we are, but there are probably also classes of subtle mistakes we’ll be more prone to than the AI. We’re only shooting for our assistant to avoid trying to lead us to a catastrophic outcome.
(See: Techniques for optimizing worst-case performance.)
2.1.4: I’m really not sold that training it to avoid catastrophes and training it to be corrigible will be good enough.
This is actually more a capabilities question (is our system good enough at trying very hard to avoid catastrophes to actually avoid a catastrophe?) than an alignment question. A major open question in Paul’s agenda is how we can formalize performance guarantees well enough to state actual worst-case guarantees.
(See: Two guarantees and Techniques for optimizing worst-case performance)
2.2. Universality
2.2.1. What sorts of cognition will our assistants be able to perform?
We should roughly expect it to think in ways that would be approved by an HCH (short for “human consulting HCH”). To describe HCHs, let me start by describing a weak HCH:
Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.
That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…
I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes (who can break down those questions and pass them along to other humans-in-boxes, etc.) and get back answers instantaneously. A few remarks:
This formalism tries to capture some notion of “what would H think about some topic if H thought about it for arbitrarily long amounts of time”? For example, H might make partial progress on some question, and then share this progress with some other H and ask it to make more progress, who might do the same.
A weak HCH could simulate the cognitive labor of an economy the size of the US economy. After all, a weak HCH can emulate a single human thinking for a long time, so it can emulate teams of humans thinking for a long time, and thus teams of teams of humans thinking for a long time, etc. If you imagine a corporation as teams of teams of teams of humans performing cognitive labor, you get that a weak HCH can emulate the output of an arbitrary corporation, and thus collections of arbitrary corporations communicating with one another.
Many tasks that don’t intuitively seem like they can be broken down, can in fact be fairly substantially broken down. For example, making progress on difficult math problems seems difficult to break down. But you could break down progress on a math problem into something like (think for a while about possible angles of attack) + (try each angle of attack, and recurse on the new math problem). And (think for a while about possible angles of attack) can be reduced into (look at features of this problem and see if you’ve solved anything similar), which can be reduced into focusing on specific features, and so on.
Strong HCH, or just HCH, is a variant of weak HCHs where the agents-in-boxes are able to communicate with each other directly, and read and write to some shared external memory, in addition to being able to ask, answer, and break down questions. Note that they would be able to implement arbitrary Turing machines this way, and thus avoid any limits on cognition imposed by the structure of weak HCH.
(Note: most people think “HCH” refers to “weak HCH”, but whenever Paul mentions HCHs, he now refers to strong HCHs.)
The exact relationship between HCH and the agents produced through iterated amplification and distillation is confusing and very commonly misunderstood:
HCHs should not be visualized as having humans in the box. They should be thought of as having some corrigible assistant inside the box, much like the question-answering system described in 2.1.1.
Throughout the iterated amplification and distillation process, there is never any agent whose cognition resembles an HCH of the corrigible assistant. In particular, agents produced via distillation are general RL agents with no HCH-like constraints on their cognition. The closest resemblance to HCH appears during amplification, during which a superagent (formed out of copies of the agent getting amplified) performs tasks by breaking them down and distributing them among the agent copies.
(As of the time of this writing, I am still confused about the sense in which the agent’s cognition is approved by an HCH, and what that means about the agent’s capabilities.)
(See: Humans consulting HCH and Strong HCH.)
2.2.2. Why should I think the HCH of some simple question-answering AI assistant can perform arbitrarily complex cognition?
All difficult and creative insights stem from chains of smaller and easier insights. So long as our first AI assistant is a universal reasoner (i.e., it can implement arbitrary Turing machines via reflection), it should be able to realize arbitrarily complex things if it reflects for long enough. For illustration, Paul thinks that chimps aren’t universal reasoners, and that most humans past some intelligence threshold are universal.
If this seems counterintuitive, I’d claim it’s because we have poor intuitions around what’s achievable with 2,000,000,000 years of reflection. For example, it might seem that an IQ 120 person, knowing no math beyond arithmetic, would simply be unable to prove Fermat’s last theorem given arbitrary amounts of time. But if you buy that:
An IQ 180 person could, in 2,000 years, prove Fermat’s last theorem knowing nothing but arithmetic (which seems feasible, given that most mathematical progress was made by people with IQs under 180)
An IQ 160 person could, in 100 years, make the intellectual progress an IQ 180 person could in 1 year
An IQ 140 person could, in 100 years, make the intellectual progress an IQ 160 person could in 1 year
An IQ 120 person could, in 100 years, make the intellectual progress an IQ 140 person could in 1 year
Then it follows that an IQ 120 person could prove Fermat’s last theorem in 2,000*100*100*100 = 2,000,000,000 years’ worth of reflection.
(See: Of humans and universality thresholds.)
2.2.3. Different reasoners can reason in very different ways and reach very different conclusions. Why should I expect my amplified assistant to reason anything like me, or reach conclusions that I’d have reached?
You shouldn’t expect it to reason anything like you, you shouldn’t expect it to reach the conclusions you’d reach, and you shouldn’t expect it to realize everything you’d consider obvious (just like you wouldn’t realize everything it would consider obvious). You should expect it to reason in ways you approve of, which should constrain its reasoning to be sensible and competent, as far as you can tell.
The goal isn’t to have an assistant that can think like you or realize everything you’d realize. The goal is to have an assistant who can think in ways that you consider safe and substantially helpful.
2.2.4. HCH seems to depend critically on being able to break down arbitrary tasks into subtasks. I don’t understand how you can break down tasks that are largely intuitive or perceptual, like playing Go very well, or recognizing images.
Go is actually fairly straightforward: an HCH can just perform an exponential tree search. Iterated amplification and distillation applied to Go is not actually that different from how AlphaZero trains to play Go.
Image recognition is harder, but to the extent that humans have clear concepts of visual features they can reference within images, the HCH should be able to focus on those features. The cat vs. dog debate in Geoffrey Irving’s approach to AI safety via debate gives some illustration of this.
Things get particularly tricky when humans are faced with a task they have little explicit knowledge about, like translating sentences between languages. Paul did mention something like “at some point, you’ll probably just have to stick with relying on some brute statistical regularity, and just use the heuristic that X commonly leads to Y, without being able to break it down further”.
(See: Wei Dai’s comment on Can Corrigibility be Learned Safely, and Paul’s responseto a different comment by Wei Dai on the topic.)
2.2.5: What about tasks that require significant accumulation of knowledge? For example, how would the HCH of a human who doesn’t know calculus figure out how to build a rocket?
This sounds difficult for weak HCHs on their own to overcome, but possible for strong HCHs to overcome. The accumulated knowledge would be represented in the strong HCHs shared external memory, and the humans essentially act as “workers” implementing a higher-level cognitive system, much like ants in an ant colony. (I’m still somewhat confused about what the details of this would entail, and am interested in seeing a more fleshed out implementation.)
2.2.6: It seems like this capacity to break tasks into subtasks is pretty subtle. How does the AI learn to do this? And how do we find human operators (besides Paul) who are capable of doing this?
Ought is gathering empirical data about task decomposition. If that proves successful, Ought will have numerous publicly available examples of humans breaking down tasks.