Something a little unclear to me: do you have a hypothesis that factored cognition is likely to be useful for humans, or is it more like a tool that is intended mainly for AIs? (I have a vague recollection of “the reason this is important is so that you can have very powerful AIs that are only active for a short time, able to contribute useful work but deactivated before they have a chance of evolving in unfriendly directions”)
I have a vague recollection of “the reason this is important is so that you can have very powerful AIs that are only active for a short time, able to contribute useful work but deactivated before they have a chance of evolving in unfriendly directions”
I think that’s somewhat on the right track but I wouldn’t put it quite that way (based on my understanding).
I think it’s not that you have an AI that you don’t want to run for a long time. It’s that you’re trying to start with something simple that is aligned, and then scale it up via IDA. And the simple thing that you start with is a human. (Or an AI trained to mimic a human.) And it’s easier to get training data on what a human will do over the course of 10 minutes than over 10 years. (You can get a lot of 10 minute samples… 10 year samples not so much.)
So the questions are: 1) can you scale up something that is simple and aligned, in a way that preserves the alignment?, and 2) will your scaled up system be competitive with whatever else is out there—with systems that were trained some other way than by scaling up a weak-but-aligned system (and which may be unaligned)?
The Factored Cognition Hypothesis is about question #2, about the potential capabilities of a scaled up system (rather than about the question of whether it would stay aligned). The way IDA scales up agents is analogous to having a tree of exponentially many copies of the agent, each doing a little bit of work. So you want to make sure you can split up work and still be competitive with what else is out there. If you can’t, then there will be strong incentives to do things some other way, and that may result in AI systems that aren’t aligned.
Again, the point is not that you have an ambiguously-aligned AI system that you want to avoid running for a long time. It’s that you have a weak-but-aligned AI system. And you want to make it more powerful while preserving its alignment. So you have it cooperate with several copies of itself. (That’s amplification.) And then you train your next system to mimic that group of agents working together. (That’s distillation.) And then you repeat. (That’s iteration.)
The hope is that by constructing your AI out of this virtual tree of human-mimicking AIs, it’s a little more grounded and constrained. And less likely to do something antithetical to human values than an AI you got by just giving your neural network some objective and telling it to go forth and optimize.
Note that all this is relevant in the context of prosaic AI alignment—aligning AI systems built out of something not-too-different from current ML. In particular, I believe Factored Cognition working well would be a prerequisite for most of the proposals here. (The version of amplification I described was imitative amplification, but I believe the FCH is relevant for the other forms of amplification too.)
That’s my understanding anyway. Would appreciate any corrections from others.
EDIT: On second thought, what you describe sounds a bit like myopia, which is a property that some people think these systems need to have, so your vague recollection may have been more on the right track than I was giving it credit for.
It’s that you’re trying to start with something simple that is aligned, and then scale it up via IDA.
I don’t think this is the hope, see here for more. I think the hope is that the base unit is ‘cheap’ and ‘honest’.
I think the ‘honesty’ criterion there is quite similar to ‘myopia’, in that both of them are about “just doing what’s in front of you, instead of optimizing for side effects.” This has some desirable properties, in that a myopic system won’t be ‘out to get you’, but also rules out getting some desirable properties. As an aside, I think it might work out that no myopic system can be corrigible (which doesn’t restrict systems built out of myopic parts, as those systems are not necessarily myopic).
I agree with the rest of your comment that Factored Cognition is about question #2, of how much capability is left on the table by using assemblages of myopic parts.
I don’t think this is the hope, see here for more. I think the hope is that the base unit is ‘cheap’ and ‘honest’.
Hmm, cheap and honest makes some sense, but I’m surprised to hear that the hope is not that the base unit is aligned, because that seems to clash with how I’ve seen this discussed before. For example, from Ajeya’s post summarizing IDA:
The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans?
Which suggests to me that the base units (which narrowly replicate human behavior) are expected to be aligned.
More:
Moreover, because in each of its individual decisions each copy of A[0] continues to act just as a human personal assistant would act, we can hope that Amplify(H, A[0]) preserves alignment.
...
Because we assumed Amplify(H, A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors.
Which comes out and explicitly says that we want the amplify step to preserve alignment. (Which only makes sense if the agent at the previous step was aligned.)
Is it possible that this is just a terminological issue, where aligned is actually being used to mean what you would call honest (and not whatever Vaniver_2018 thought aligned meant)?
As many people have pointed out, it could be difficult to become confident that a system produced through this sort of process is aligned—that is, that all its cognitive work is actually directed towards solving the tasks it is intended to help with.
That definition of alignment seems to be pretty much the same thing as your honesty criterion:
Now it seems that the real goal is closer to an ‘honesty criterion’; if you ask a question, all the computation in that unit will be devoted to answering the question, and all messages between units are passed where the operator can see them, in plain English.
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
Vaniver_2018 thought ‘aligned’ meant something closer to “I was glad I ran the program” instead of “the program did what I told it to do” or “the program wasn’t deliberately out to get me.”
I… actually don’t know what myopia is supposed to mean in the AI context (I had previously commented that the post Defining Myopia doesn’t define myopia and am still kinda waiting on a more succinct definition)
Heh. I actually struggled to figure out which post to link there because I was looking for one that would provide a clear, canonical definition, and ended up just picking the tag page. Here are a couple definitions buried in those posts though:
We can think of a myopic agent as one that only considers how best to answer the single question that you give to it rather than considering any sort of long-term consequences
I’ll define a myopic reinforcement learner as a reinforcement learning agent trained to maximise the reward received in the next timestep, i.e. with a discount rate of 0.
...
I should note that so far I’ve been talking about myopia as a property of a training process. This is in contrast to the cognitive property that an agent might possess, of not making decisions directly on the basis of their long-term consequences; an example of the latter is approval-directed agents.
So, a myopic agent is one that only considers the short-term consequences when deciding how to act. And a myopic learner is one that is only trained based on short-term feedback.
(And perhaps worth noting, in case it’s not obvious, I assume the name was chose because myopia means short-sightedness, and these potential AIs are deliberately made to be short-sighted, s.t. they’re not making long-term, consequentialist plans.)
My take on myopia is that it’s “shortsightedness” in the sense of only trying to do “local work”. If I ask you what two times two is, you say “four” because it’s locally true, rather than because you anticipate the consequences of different numbers, and say “five” because that will lead to a consequence you prefer. [Or you’re running a heuristic that approximates that anticipation.]
If you knew that everyone in a bureaucracy were just “doing their jobs”, this gives you a sort of transparency guarantee, where you just need to follow the official flow of information to see what’s happening. Unless asked to design a shadow bureaucracy or take over, no one will do that.
However, training doesn’t give you this by default; people in the bureaucracy are incentivized to make their individual departments better, to say what the boss wants to hear, to share gossip at the water cooler, and so on. One of the scenarios people consider is the case where you’re training an AI to solve some problem, and at some point it realizes it’s being trained to solve that problem and so starts performing as well as it can on that metric. In animal reinforcement training, people often talk about how both you’re training the animal to perform tricks for rewards, and the animal is training you to reward it! The situation is subtly different here, but the basic figure-ground inversion holds.
(I think humans are unaligned, but assume that objection has been brought up before. Though I can still imagine that 10 minute humans provide a better starting point than other competitive tools, and may be the least bad option)
Like, most humans when given massive power over the universe would probably accidentally destroy themselves, and possibly all of humanity with it (Eliezer talks a bit about this in HPMOR in sections I don’t want to reference because of spoilers, and Wei Dai has talked about this a bit in a bunch of comments related to “the human alignment problem”). I think that maybe I could avoid doing that, but only because I am really mindful of the risk, and I don’t think me from 5 years ago would have been safe to drastically scale up, even with respect to just my own values.
I share habryka’s concerns re: “unaligned with yourself”, but, I think I was missing (or had forgotten) that part of the idea here was you’re using.… an uploaded clone of yourself, so you’re at least more likely to be aligned with yourself even if when scaled up you’re not aligned with anyone else.
Not sure if you were just being poetic, but FWIW I believe the idea (in HCH for example), is to use an ML system trained to produce the same answers that a human would produce, which is not strictly-speaking an upload (unless the only way to imitate is actually to simulate in detail, s.t. the ML system ends up growing an upload inside it?).
If it’s “a human”, I’m back to “humans are unfriendly by default” territory.
[Edit: But, I had in fact also not been tracking that it’s not a strict upload, it’s a trained on human actions. I think I recall reading that earlier but had forgotten. I did leave the ”...” in my summary because I wasn’t quite sure if upload was the right word though. That all said, being merely trained on human actions, whether mine or someone else’s, I think makes it even more likely to be unfriendly than an upload]
Thanks for writing this up!
Something a little unclear to me: do you have a hypothesis that factored cognition is likely to be useful for humans, or is it more like a tool that is intended mainly for AIs? (I have a vague recollection of “the reason this is important is so that you can have very powerful AIs that are only active for a short time, able to contribute useful work but deactivated before they have a chance of evolving in unfriendly directions”)
I think that’s somewhat on the right track but I wouldn’t put it quite that way (based on my understanding).
I think it’s not that you have an AI that you don’t want to run for a long time. It’s that you’re trying to start with something simple that is aligned, and then scale it up via IDA. And the simple thing that you start with is a human. (Or an AI trained to mimic a human.) And it’s easier to get training data on what a human will do over the course of 10 minutes than over 10 years. (You can get a lot of 10 minute samples… 10 year samples not so much.)
So the questions are: 1) can you scale up something that is simple and aligned, in a way that preserves the alignment?, and 2) will your scaled up system be competitive with whatever else is out there—with systems that were trained some other way than by scaling up a weak-but-aligned system (and which may be unaligned)?
The Factored Cognition Hypothesis is about question #2, about the potential capabilities of a scaled up system (rather than about the question of whether it would stay aligned). The way IDA scales up agents is analogous to having a tree of exponentially many copies of the agent, each doing a little bit of work. So you want to make sure you can split up work and still be competitive with what else is out there. If you can’t, then there will be strong incentives to do things some other way, and that may result in AI systems that aren’t aligned.
Again, the point is not that you have an ambiguously-aligned AI system that you want to avoid running for a long time. It’s that you have a weak-but-aligned AI system. And you want to make it more powerful while preserving its alignment. So you have it cooperate with several copies of itself. (That’s amplification.) And then you train your next system to mimic that group of agents working together. (That’s distillation.) And then you repeat. (That’s iteration.)
The hope is that by constructing your AI out of this virtual tree of human-mimicking AIs, it’s a little more grounded and constrained. And less likely to do something antithetical to human values than an AI you got by just giving your neural network some objective and telling it to go forth and optimize.
Note that all this is relevant in the context of prosaic AI alignment—aligning AI systems built out of something not-too-different from current ML. In particular, I believe Factored Cognition working well would be a prerequisite for most of the proposals here. (The version of amplification I described was imitative amplification, but I believe the FCH is relevant for the other forms of amplification too.)
That’s my understanding anyway. Would appreciate any corrections from others.
EDIT: On second thought, what you describe sounds a bit like myopia, which is a property that some people think these systems need to have, so your vague recollection may have been more on the right track than I was giving it credit for.
I don’t think this is the hope, see here for more. I think the hope is that the base unit is ‘cheap’ and ‘honest’.
I think the ‘honesty’ criterion there is quite similar to ‘myopia’, in that both of them are about “just doing what’s in front of you, instead of optimizing for side effects.” This has some desirable properties, in that a myopic system won’t be ‘out to get you’, but also rules out getting some desirable properties. As an aside, I think it might work out that no myopic system can be corrigible (which doesn’t restrict systems built out of myopic parts, as those systems are not necessarily myopic).
I agree with the rest of your comment that Factored Cognition is about question #2, of how much capability is left on the table by using assemblages of myopic parts.
Hmm, cheap and honest makes some sense, but I’m surprised to hear that the hope is not that the base unit is aligned, because that seems to clash with how I’ve seen this discussed before. For example, from Ajeya’s post summarizing IDA:
Which suggests to me that the base units (which narrowly replicate human behavior) are expected to be aligned.
More:
Which comes out and explicitly says that we want the amplify step to preserve alignment. (Which only makes sense if the agent at the previous step was aligned.)
Is it possible that this is just a terminological issue, where aligned is actually being used to mean what you would call honest (and not whatever Vaniver_2018 thought aligned meant)?
Some evidence in favor of this, from Andreas’s Factored Cognition post:
That definition of alignment seems to be pretty much the same thing as your honesty criterion:
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
Vaniver_2018 thought ‘aligned’ meant something closer to “I was glad I ran the program” instead of “the program did what I told it to do” or “the program wasn’t deliberately out to get me.”
I… actually don’t know what myopia is supposed to mean in the AI context (I had previously commented that the post Defining Myopia doesn’t define myopia and am still kinda waiting on a more succinct definition)
Heh. I actually struggled to figure out which post to link there because I was looking for one that would provide a clear, canonical definition, and ended up just picking the tag page. Here are a couple definitions buried in those posts though:
(from: Towards a mechanistic understanding of corrigibility)
(from: Arguments against myopic training)
So, a myopic agent is one that only considers the short-term consequences when deciding how to act. And a myopic learner is one that is only trained based on short-term feedback.
(And perhaps worth noting, in case it’s not obvious, I assume the name was chose because myopia means short-sightedness, and these potential AIs are deliberately made to be short-sighted, s.t. they’re not making long-term, consequentialist plans.)
My take on myopia is that it’s “shortsightedness” in the sense of only trying to do “local work”. If I ask you what two times two is, you say “four” because it’s locally true, rather than because you anticipate the consequences of different numbers, and say “five” because that will lead to a consequence you prefer. [Or you’re running a heuristic that approximates that anticipation.]
If you knew that everyone in a bureaucracy were just “doing their jobs”, this gives you a sort of transparency guarantee, where you just need to follow the official flow of information to see what’s happening. Unless asked to design a shadow bureaucracy or take over, no one will do that.
However, training doesn’t give you this by default; people in the bureaucracy are incentivized to make their individual departments better, to say what the boss wants to hear, to share gossip at the water cooler, and so on. One of the scenarios people consider is the case where you’re training an AI to solve some problem, and at some point it realizes it’s being trained to solve that problem and so starts performing as well as it can on that metric. In animal reinforcement training, people often talk about how both you’re training the animal to perform tricks for rewards, and the animal is training you to reward it! The situation is subtly different here, but the basic figure-ground inversion holds.
Gotcha.
(I think humans are unaligned, but assume that objection has been brought up before. Though I can still imagine that 10 minute humans provide a better starting point than other competitive tools, and may be the least bad option)
Unaligned with each other? Or… would you not consider you to be aligned with yourself?
(Btw, see my edit at the bottom of my comment above if you hadn’t notice it.)
I think also unaligned with yourself?
Like, most humans when given massive power over the universe would probably accidentally destroy themselves, and possibly all of humanity with it (Eliezer talks a bit about this in HPMOR in sections I don’t want to reference because of spoilers, and Wei Dai has talked about this a bit in a bunch of comments related to “the human alignment problem”). I think that maybe I could avoid doing that, but only because I am really mindful of the risk, and I don’t think me from 5 years ago would have been safe to drastically scale up, even with respect to just my own values.
I share habryka’s concerns re: “unaligned with yourself”, but, I think I was missing (or had forgotten) that part of the idea here was you’re using.… an uploaded clone of yourself, so you’re at least more likely to be aligned with yourself even if when scaled up you’re not aligned with anyone else.
Not sure if you were just being poetic, but FWIW I believe the idea (in HCH for example), is to use an ML system trained to produce the same answers that a human would produce, which is not strictly-speaking an upload (unless the only way to imitate is actually to simulate in detail, s.t. the ML system ends up growing an upload inside it?).
Is it “a human” or “you specifically?”
If it’s “a human”, I’m back to “humans are unfriendly by default” territory.
[Edit: But, I had in fact also not been tracking that it’s not a strict upload, it’s a trained on human actions. I think I recall reading that earlier but had forgotten. I did leave the ”...” in my summary because I wasn’t quite sure if upload was the right word though. That all said, being merely trained on human actions, whether mine or someone else’s, I think makes it even more likely to be unfriendly than an upload]
To get sufficient training data, it must surely be “a human” (in generic, smushed together, ‘modelling an ensemble of humans’ sense)