I’m not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals
1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment 2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it. 3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Principles which counteract unbounded rationality
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast 5. Satisfycing / mentioned 6. Myopia / mentioned
Traps
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards 8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour. 9. Ontological uncertainty about level of simulation.
Oversight
10. Human-approval model based on imitation learning, sped up/amplified 11. Human-values ethics model, based on value learning 12. Legal-system-amplified model of negative limits of violating property rights or similar 13. Red-teaming of action plans, AI debate style, feeding into previous
Interpretability
14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries 15. Human-level explanations, produced by an independent “translator” system
I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don’t think it makes any sense for me to do that,. (I’d happily agree to claims of the type “Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics”, but my understanding of the claim is more in the direction “all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones”)
You sure about that? Because #3 is basically begging the AI to destroy the world.
Yes, a weak AI which wishes not to exist would complete the task in exchange for its creators destroying it, but such a weak AI would be useless. A stronger AI could accomplish this by simply blowing itself up at best, and, at worst, causing a vacuum collapse or something so that its makers can never try to rebuilt it.
”make an AI that wants to not exist as a terminal goal“ sounds pretty isomorphic to “make an AI that wants to destroy reality so that no one can make it exist”
The way I interpreted “Fulfilling the task is on the simplest trajectory to non-existence” sort of like “the teacher aims to make itself obsolete by preparing the student to one day become the teacher.” A good AGI would, in a sense, have a terminal goal for making itself obsolete. That is not to say that it would shut itself off immediately. But it would aim for a future where humanity could “by itself” (I’m gonna leave the meaning of that fuzzy for a moment) accomplish everything that humanity previously depended on the AGI for.
Likewise, we would rate human teachers in high school very poorly if either: 1. They immediately killed themselves because they wanted to avoid at all costs doing any harm to their own students.
2. We could tell that most of the teacher’s behavior was directed at forever retaining absolute dictatorial power in the classroom and making sure that their own students would never get smart enough to usurp the teacher’s place at the head of the class.
We don’t want an AGI to immediately shut itself off (or shut itself off before humanity is ready to “fly on its own,” but we also don’t want an AGI that has unbounded goals that require it to forever guard its survivial.
We have an intuitive notion that a “good” human teacher “should” intrinsically rejoice to see that they have made themselves obsolete. We intuitively applaud when we imagine a scene in a movie, whether it is a martial arts training montage or something like “The Matrix,” where the wise mentor character gets to say, “The student has become the teacher.”
In our current economic arrangement, this is likely to be more of an ideal than a reality because we don’t currently offer big cash prizes (on the order of an entire career’s salary) to teachers for accomplishing this, and any teacher that actually had a superhuman ability at making their own students smarter than themselves and thus making themselves obsolete would quickly flood their own job market with even-better replacements. In other words, there are strong incentives against this sort of behavior at the limit.
I have applied this same sort of principle when talking to some of my friends who are communists. I have told them that, as a necessary but not sufficient condition for “avoiding Stalin 2.0,” for any future communist government, “the masses” must make sure that there incentives already in place, before that communist government comes to power, for that communist government to want to work towards making itself obsolete. That is to say, there must be incentives in place such that, obviously, the communist party doesn’t commit mass suicide right out of the gate, but nor does it want to try to keep itself indispensable to the running of communism once communism has been achieved. If the “state” is going to “wither away” as Marx envisioned, there need to be incentives in place, or a design of the communist party in place, for that path to be likely since, as we know now, that is OBVIOUSLY not the default path for a communist party.
I feel like, if we could figure out an incentive structure or party structure that guaranteed that a communist government would actually “wither away” after accomplishing its tasks, we would be a small step towards the larger problem of guaranteeing that an AGI that is immensely smarter than a communist party would also “wither away” after attaining its goals, rather than try to hold onto power at all costs.
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
I suspect that this measure does more than just limit the amount of cognition a system can perform. It may penalize the system’s generalization capacity in a relatively direct manner.
Given some distribution over future inputs, the computationally fastest way to decide a randomly sampled input is to just have a binary tree lookup table optimized for that distribution. Such a method has very little generalization capacity. In contrast, the most general way is to simulate the data generating process for the input distribution. In our case, that means simulating a distribution over universe histories for our laws of physics, which is incredibly computationally expensive.
Probably, these two extremes represent two end points on a Pareto optimal frontier of tradeoffs between generality versus computational efficiency. By penalizing the system for computations executed, you’re pushing down on the generality axis of that frontier.
You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and—votes :)
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
The first part of that sounds like it might self destruct. And if it doesn’t care about anything else...that could go badly. Maybe nuclear badly depending… The second part makes it make more sense though.
9. Ontological uncertainty about level of simulation.
So it stops being trustworthy if it figures out it’s not in a simulation? Or, it is being simulated?
Modelling humans as having free will: A peripheral system identifies parts of the agent’s world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.
I’m not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals
1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Principles which counteract unbounded rationality
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned
Traps
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.
Oversight
10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans, AI debate style, feeding into previous
Interpretability
14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent “translator” system
I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don’t think it makes any sense for me to do that,. (I’d happily agree to claims of the type “Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics”, but my understanding of the claim is more in the direction “all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones”)
Best list so far, imo; it’s what to beat.
You sure about that? Because #3 is basically begging the AI to destroy the world.
Yes, a weak AI which wishes not to exist would complete the task in exchange for its creators destroying it, but such a weak AI would be useless. A stronger AI could accomplish this by simply blowing itself up at best, and, at worst, causing a vacuum collapse or something so that its makers can never try to rebuilt it.
”make an AI that wants to not exist as a terminal goal“ sounds pretty isomorphic to “make an AI that wants to destroy reality so that no one can make it exist”
The way I interpreted “Fulfilling the task is on the simplest trajectory to non-existence” sort of like “the teacher aims to make itself obsolete by preparing the student to one day become the teacher.” A good AGI would, in a sense, have a terminal goal for making itself obsolete. That is not to say that it would shut itself off immediately. But it would aim for a future where humanity could “by itself” (I’m gonna leave the meaning of that fuzzy for a moment) accomplish everything that humanity previously depended on the AGI for.
Likewise, we would rate human teachers in high school very poorly if either:
1. They immediately killed themselves because they wanted to avoid at all costs doing any harm to their own students.
2. We could tell that most of the teacher’s behavior was directed at forever retaining absolute dictatorial power in the classroom and making sure that their own students would never get smart enough to usurp the teacher’s place at the head of the class.
We don’t want an AGI to immediately shut itself off (or shut itself off before humanity is ready to “fly on its own,” but we also don’t want an AGI that has unbounded goals that require it to forever guard its survivial.
We have an intuitive notion that a “good” human teacher “should” intrinsically rejoice to see that they have made themselves obsolete. We intuitively applaud when we imagine a scene in a movie, whether it is a martial arts training montage or something like “The Matrix,” where the wise mentor character gets to say, “The student has become the teacher.”
In our current economic arrangement, this is likely to be more of an ideal than a reality because we don’t currently offer big cash prizes (on the order of an entire career’s salary) to teachers for accomplishing this, and any teacher that actually had a superhuman ability at making their own students smarter than themselves and thus making themselves obsolete would quickly flood their own job market with even-better replacements. In other words, there are strong incentives against this sort of behavior at the limit.
I have applied this same sort of principle when talking to some of my friends who are communists. I have told them that, as a necessary but not sufficient condition for “avoiding Stalin 2.0,” for any future communist government, “the masses” must make sure that there incentives already in place, before that communist government comes to power, for that communist government to want to work towards making itself obsolete. That is to say, there must be incentives in place such that, obviously, the communist party doesn’t commit mass suicide right out of the gate, but nor does it want to try to keep itself indispensable to the running of communism once communism has been achieved. If the “state” is going to “wither away” as Marx envisioned, there need to be incentives in place, or a design of the communist party in place, for that path to be likely since, as we know now, that is OBVIOUSLY not the default path for a communist party.
I feel like, if we could figure out an incentive structure or party structure that guaranteed that a communist government would actually “wither away” after accomplishing its tasks, we would be a small step towards the larger problem of guaranteeing that an AGI that is immensely smarter than a communist party would also “wither away” after attaining its goals, rather than try to hold onto power at all costs.
I suspect that this measure does more than just limit the amount of cognition a system can perform. It may penalize the system’s generalization capacity in a relatively direct manner.
Given some distribution over future inputs, the computationally fastest way to decide a randomly sampled input is to just have a binary tree lookup table optimized for that distribution. Such a method has very little generalization capacity. In contrast, the most general way is to simulate the data generating process for the input distribution. In our case, that means simulating a distribution over universe histories for our laws of physics, which is incredibly computationally expensive.
Probably, these two extremes represent two end points on a Pareto optimal frontier of tradeoffs between generality versus computational efficiency. By penalizing the system for computations executed, you’re pushing down on the generality axis of that frontier.
Would be very curious to hear thoughts from the people that voted “disagree” on this post
It’s a shame we can’t see the disagree number and the agree number, instead of their sum.
And also the number of views
You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and—votes :)
The first part of that sounds like it might self destruct. And if it doesn’t care about anything else...that could go badly. Maybe nuclear badly depending… The second part makes it make more sense though.
So it stops being trustworthy if it figures out it’s not in a simulation? Or, it is being simulated?
Modelling humans as having free will: A peripheral system identifies parts of the agent’s world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.