Commentary on “AGI Safety From First Principles by Richard Ngo, September 2020”

The intent of this document is to critically engage with the points and perspectives put forward in the post “AGI Safety From First Principles” by Richard Ngo. I will mostly take quotes from the post and comment on those, often involving the larger context of the section. For this reason it is advised to read the post by Richard Ngo first, before reading this commentary.

Please keep in mind that this is not supposed to be a balanced review. I agree with much of what the author says and focus on the points of disagreement or potential discussion.

_________________________________________________________________________

1 Introduction

“The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own.”

Agreed. While there are a number of concerns about machine learning involving bias, manipulation, moral decision making and/or automation, the key concern motivating technical AGI safety research should be the autonomous, misaligned agency of AGI.

_________________________________________________________________________

2 Superintelligence

2.1 Narrow and general intelligence

“The key point is that almost all of this evolutionary and childhood learning occurred on different tasks from the economically useful ones we perform as adults. We can perform well in the latter category only by reusing the cognitive skills and knowledge that we gained previously. In our case, we were fortunate that those cognitive skills were not too specific to tasks in the ancestral environment, but were rather very general skills.”

I believe this is an important point that we ought to engage with interdisciplinarily. For me, the notion of “Exaptation” from evolutionary biology comes to mind as a relevant concept to apply both to our understanding of the development of human cognition and the potential avenues (intended or not) for the development of artificial cognition.

_________________________________________________________________________

2.2 Paths to superintelligence

“Because of the ease and usefulness of duplicating an AGI, I think that collective AGIs should be our default expectation for how superintelligence will be deployed. “

I believe it makes sense to also consider the case of partial duplication. It seems to me that an AGI that duplicates itself might be threatening its own objective since it creates a more competent/powerful entity, the collective (even when it’s just two), the behaviour of which is partially unpredictable.
If the AGI however solves that problem, then our analogies regarding a second species or cultural development will probably mislead us.

It seems also plausible that the AGI would be interested in a hierarchy of oversight, which might well include copying components of itself (containing the training data) and providing those to smaller sub-systems with more dedicated tasks—you don’t need a full-fledged superintelligence for every sub-task. This is mostly considering self-duplication, as it seems unlikely that we humans will get to copy any powerful and misaligned AGI before being disempowered by the first system that crosses that threshold. It remains to be discussed how likely it is that a group of weaker systems would cross that threshold before that.

_________________________________________________________________________

“Firstly, due to the ease of duplicating AIs, there’s no meaningful distinction between an AI improving “itself” versus creating a successor that shares many of its properties.

Secondly, modern AIs are more accurately characterised as models which could be retrained, rather than software which could be rewritten: almost all of the work of making a neural network intelligent is done by an optimiser via extensive training. Even a superintelligent AGI would have a hard time significantly improving its cognition by modifying its neural weights directly; it seems analogous to making a human more intelligent via brain surgery (albeit with much more precise tools than we have today).
So it’s probably more accurate to think about self-modification as the process of an AGI modifying its high-level architecture or training regime, then putting itself through significantly more training. This is very similar to how we create new AIs today, except with humans playing a much smaller role.

Thirdly, if the intellectual contribution of humans does shrink significantly, then I don’t think it’s useful to require that humans are entirely out of the loop for AI behaviour to qualify as recursive improvement (although we can still distinguish between cases with more or less human involvement). “

The notion of AI designing a successor seems plausible. Training will perhaps become less of an issue as we get better at designing sophisticated simulated training environments.

My sense is that general intelligence is a matter of confronting a sufficiently expressive computational system with challenges such that said system efficiently changes its configuration/expression until it meets the requirements of the challenges.
All of these challenges are necessarily of purely mathematical nature, and the goal is to land on configurations that express general challenge solving capabilities, capable of abstracting away from the specific challenge to a more general one while retaining learned understanding /challenge solving ability.

_________________________________________________________________________

“And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling—which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.”

I don’t think I follow this reasoning.
If the AGI will have to invest time into retraining research before first retraining, then we should consider: if the problem is relatively simple, the AGI will do it quickly, whereas if the problem is hard even for the AGI, that doesn’t bode well for our chances of solving it first—so the optimism seems to hinge on the assumption that the AGI is a less capable researcher in this context than a human research team.

_________________________________________________________________________

3 Goals and Agency

“However, the link from instrumentally convergent goals to dangerous influence seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals, and which identify and pursue those instrumental goals even when it leads to extreme outcomes (a set of traits which I’ll call goal-directed agency). It’s not yet clear that AGIs will be this type of agent, or have this type of goals.”

The notion of extreme outcomes seems anthropocentric here, unless it refers to a set of states that significantly limit the diversity of future states.
There are also better reasons than just intuition to assume that AGIs will fall into this category, but I agree that it is not certain.
For instance, it seems to be quite a general principle that the “final goals” drive the behaviour and indeed learning process of an AGI. In that sense, the final goal needs to be at least large-scale enough to require the expression of general intelligence.
This can be complicated in some ways, for example by introducing the feature of changing goals.

_________________________________________________________________________

“Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals.”

Granted, though we’d want to err on the side of caution here.

_________________________________________________________________________

3.1 Frameworks for thinking about agency

“current image classifiers and (probably) RL agents like AlphaStar and OpenAI Five: they can be competent at achieving their design objectives without understanding what those objectives are, or how their actions will help achieve them.”

I take this as meaning that these systems don’t appear to have a dedicated internal representation of their goals (which we would expect to open up a strategic dimension the associated behaviour of which we would recognize as demonstrating understanding).

_________________________________________________________________________

“If we create agents whose design objective is to accumulate power, but without the agent itself having the goal of doing so (e.g. an agent which plays the stock market very well without understanding how that impacts society) that would qualify as the third possibility outlined above.”

Is the meaningful distinction here that the agent which plays at the stock market only looks at a smaller picture/system within which it tries to be competent?
It seems that a system considering a larger picture but with the same objective would be more competent long-term, since it only needs to pay attention to the larger picture in so far as that expectedly helps it e.g. play the stock market.

_________________________________________________________________________

“While we do have an intuitive understanding of complex human goals and how they translate to behaviour, the extent to which it’s reasonable to extend those beliefs about goal-directed cognition to artificial intelligences is the very question we need a theory of agency to answer.”

Agreed, I can only comment that our sense of which behaviour demonstrates a commitment to which goals is bounded not only by our ability to recognize a complex objective, but additionally our ability to recognize strategic behaviour with respect to that objective.

_________________________________________________________________________

“what is explicitly represented within a human brain, if anything?”

I agree that the term “explicit representation” is tricky. This relates both to our ability to recognize such a representation (that might well be spread out through a larger space, intersecting with other representations etc) and to the representation being “only” about the thing, right?

I do believe that any behavior can be interpreted as goal directed in some context, so the more useful concept of goal-directed agency seems to require an internal representation of the goal. This way, even if the system’s behavior is interrupted, it can correct for an updated path—and it can only compute that path if it has a representation of what it is aiming at.

_________________________________________________________________________

“1. Self-awareness: for humans, intelligence seems intrinsically linked to a first-person perspective. But an AGI trained on abstract third-person data might develop a highly sophisticated world-model that just doesn’t include itself or its outputs.
A sufficiently advanced language or physics model might fit into this category.“

A first-person perspective is a very loaded concept. If we for instance think about what attention means, we find that it relates to processing the input into a form that retains data that is expectedly relevant to the current objective, while representing other input data with lower resolution or not at all, weighted by attention on its expected relevance. To the extent to which the AGI observes itself and its own outputs (or those of similar systems), it is plausible for it to represent itself, without even necessarily reflecting that this is what it is doing.
For embodied systems that interact with a 3d environment, it seems obvious how attention would select for a good representation of the body, since the body is present in every environment and often very relevant to model accurately. This works with more abstract and digital notions of embodiment as well.

I do agree that it is plausible for a physics or language model to lack this property, depending on the training process—if we expect experimentation by the system to play a significant role in the development of its intelligence, however, the system would have to come into contact with its own outputs and their consequences, incentivising at least some level of self-modeling.

_________________________________________________________________________

“2. Planning: highly intelligent agents will by default be able to make extensive and sophisticated plans. But in practice, like humans, they may not always apply this ability. Perhaps, for instance, an agent is only trained to consider restricted types of plans.
Myopic training attempts to implement such agents; more generally, an agent could have limits on the actions it considers. For example, a question-answering system might only consider plans of the form “first figure out subproblem 1, then figure out subproblem 2, then...”. “

I agree, though that seems like an alignment-tax that is potentially prohibitive. To have a high-quality modular planning ability probably scales very well, if the training environment is sufficiently diverse.

_________________________________________________________________________

“3. Consequentialism: the usual use of this term in philosophy describes agents which believe that the moral value of their actions depends only on those actions’ consequences; here I’m using it in a more general way, to describe agents whose subjective preferences about actions depend mainly on those actions’ consequences.
It seems natural to expect that agents trained on a reward function determined by the state of the world would be consequentialists. But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse. “

I have a strong expectation that what is true for humans here is true across many architectures. Because of the heuristical nature of “understanding”, the deontological constraints can be learned as outperforming consequential reasoning in certain domains, so I don’t think that there is anything necessarily non-consequentialist about applying different forms of reasoning where they expectedly work best (with updates, of course).

_________________________________________________________________________

“4. Scale: agents which only care about small-scale events may ignore the long-term effects of their actions. Since agents are always trained in small scale environments, developing large-scale goals requires generalisation (in ways that I discuss below). “

In principle, these systems have the unique advantage of not being constrained in their attention to large scale events by ways of sufficiently intense competition. However, I think the point still stands.

_________________________________________________________________________

“5. Coherence: humans lack this trait when we’re internally conflicted—for example, when our system 1 and system 2 goals differ—or when our goals change a lot over time.
While our internal conflicts might just be an artefact of our evolutionary history, we can’t rule out individual AGIs developing modularity which might lead to comparable problems. However, it’s most natural to think of this trait in the context of a collective, where the individual members could have more or less similar goals, and could be coordinated to a greater or lesser extent.”

For me, this one is the hardest to judge. For the time being, this appears very architecture-sensitive.

_________________________________________________________________________

“6. Flexibility: an inflexible agent might arise in an environment in which coming up with one initial plan is usually sufficient, or else where there are tradeoffs between making plans and executing them. Such an agent might display sphexish behaviour.
Another interesting example might be a multi-agent system in which many AIs contribute to developing plans—such that a single agent is able to execute a given plan, but not able to rethink it very well. “

Another reason for an AGI to develop towards a hierarchy of oversight and generalisation, rather than casually copying itself.
My guess is that AGIs will operate on a cognitive time scale that will greatly shift the optimal balance between “thought and action”, if compared to humans. Finer granularity of planning is weighed against the scale of planning steps, which a sophisticated system should be able to optimize through experience.

_________________________________________________________________________

“A question-answering system (aka an oracle) could be implemented by an agent lacking either planning or consequentialism.”

I am not so sure about that. If we are talking about superintelligent systems, an oracle could be capable of answering questions related to outcomes of long-term-action, which would require the ability to “plan for someone else″ - and with the consequentialism, I stand by my earlier point: it’s just about which model the AGI thinks works better in the given context.

_________________________________________________________________________

“A highly agentic AI which has the goal of remaining subordinate to humans might never take influence seeking actions.”

This is not the default expectation, I think. Unless otherwise specified, such a system may be incentivised to “bring about a world” within which it can certainly and optimally be subordinate to humans. There are many instrumental reasons to be influence-seeking in so far as it not realized as insubordinate behaviour.

_________________________________________________________________________

3.2 The likelihood of developing highly agentic AGI

“If we train AGI in a model-free way, I predict it will end up planning using an implicit model anyway.”

Agreed.
_________________________________________________________________________

“Our best language models already generalise well enough from their training data that they can answer a wide range of questions. I can imagine them becoming more and more competent via unsupervised and supervised training, until they are able to answer questions which no human knows the answer to, but still without possessing any of the properties listed above.
A relevant analogy might be to the human visual system, which does very useful cognition, but which is not very “goal-directed” in its own right. “

This is a fair point—although for the sake of keeping in mind AI safety concerns, it should be mentioned that there are relatively trivial ways to put such systems as a modular component into a less competent but more agentic system!
With the human visual system, we need to keep in mind that it is embedded into a greater cognitive system—the coevolution can constrain the computational role/ability of the visual system. This is not analogously the case for these language models.

_________________________________________________________________________

“My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it.”

I’m largely on board with this, although I caution against taking this out of context.
High intelligence is unlikely to develop without the system having agentic attributes, since agentic evolution appears more efficient.
It is completely plausible for a superintelligent system to be non-agentic and non-influence-seeking—I believe we are just unlikely to arrive there first. Repeating the earlier point, I worry that any such non-agentic system could be integrated into an architecture such as to make an agentic system with relatively low intelligence into a superintelligent agentic system.
For example, a less competent AI playing the stock market could be combined with a superintelligent oracle, such that the first system consults the oracle for how to act in order to maximise profit.

_________________________________________________________________________

“However, there are also arguments that it will be difficult to train AIs to do intellectual work without them also developing goal-directed agency. In the case of humans, it was the need to interact with an open-ended environment to achieve our goals that pushed us to develop our sophisticated general intelligence.”

An objective in the sense discussed here usually describes which state transitions of its environment the agent is incentivised to understand/learn, and by extension also those state transitions that are a few orders of connection removed from the most relevant ones. In a way, the initial goal serves as a complexity-reducer that gives order and direction to how the understanding of the environment is being constructed and generalised.

_________________________________________________________________________

“We might expect an AGI to be even more agentic if it’s trained, not just in a complex environment, but in a complex competitive multi-agent environment. Agents trained in this way will need to be very good at flexibly adapting plans in the face of adversarial behaviour; and they’ll benefit from considering a wider range of plans over a longer timescale than any competitor.
On the other hand, it seems very difficult to predict the overall effect of interactions between many agents—in humans, for example, it led to the development of (sometimes non-consequentialist) altruism.”

This is about the incentive landscape that gets created, with us trying to be aware of game-theoretical traps in that topology. I agree that this would most likely efficiently facilitate the development of strongly agentic AGIs.

_________________________________________________________________________

“I think most safety researchers would argue that we should prioritise research directions which produce less agentic AGIs, and then use the resulting AGIs to help us align later more agentic AGIs.”

This does seem like the prudent path, although it might be good to point out in which ways this is a bit shallow.

If we want our systems to be “intelligibly intelligent” instead of black boxes, there is a significant incentive to make them construct their “internal representations” in ways that we can interpret in terms of their usefulness with respect to their objective.

I have two points here:

1. Making systems less agentic might make them less intelligible

2. As pointed out earlier, it is easy to underestimate the ways in which a competent non agentic system can be used/included in an agentic system. The modifications necessary to turn a competent non agentic system into a competent agentic system are for trivial reasons much easier to find than those modifications that turn a system with low competence into a competent non agentic system. We are already in the “neighborhood” of general intelligence or should at least seriously consider that possibility.

From that perspective, it might be dangerous to develop sophisticated non agentic systems without simultaneously having a good understanding of modular agentic systems—I’d be wary of putting the research on and development of agentic systems on hold for that reason.

_________________________________________________________________________

3.3 Goals as generalised concepts

“Rather, an agent’s goals can be formulated in terms of whatever concepts it possesses”

Good insight. This links in with the idea of inner alignment: the mesa-objective is captured in terms of the concepts that the agent possesses, whereas the objective informing the reward signal may or may not get correctly conceptualized over time.

_________________________________________________________________________

“an agent which has always been rewarded for accumulating resources in its training environment might internalise the goal of “amassing as many resources as possible”.
Similarly, agents which are trained adversarially in a small-scale domain might develop a goal of outcompeting each other which persists even when they’re both operating at a very large scale. “

Agreed. It seems plausible for agents to develop “personal goals/objectives” that they somehow reward themselves for pursuing.

_________________________________________________________________________

3.4 Groups and agency

“It’s also possible that the members of a collective AGI have not been trained to interact with each other at all, in which case cooperation between them would depend entirely on their ability to generalise from their existing skills.”

If we make those copies, sure, but I would question if an early AGI would create copies before it becomes intelligent enough to see the problems with copying that we can see.

_________________________________________________________________________

4 Alignment

“My opinion is that defining alignment in maximalist terms is unhelpful, because it bundles together technical, ethical and political problems.
While it may be the case that we need to make progress on all of these, assumptions about the latter two can significantly reduce clarity about technical issues.”

I agree with the clarity issues, although I am worried that there might not be much room between the maximalist and minimalist approaches.

Can we transform this question into one of corrigibility?

If we are talking about a superintelligence aligned with H like in the example, one would expect it to act according to a theory of H’s mind and in that sense have a maximalist approach that captures the preferences and views of H—the description is simply stored in a brain, rather than in explicit linguistic formula—in principle it is just a complex description that assigns preferences to future world states, with the disadvantage that the “brain description” is more difficult to interpret and brittle (exploitable, subject to bias) in some ways that we already know, whereas the disadvantage of the other description (formula) is that it might not even be as good as the brain one.

_________________________________________________________________________

“When I talk about misaligned AGI, the central example in my mind is not agents that misbehave just because they misunderstand what we want, or interpret our instructions overly literally (which Bostrom [2014] calls “perverse instantiation”).”

I think that “perverse instantiation” is within the class of problems that the author is talking about.

_________________________________________________________________________

“my main concern is that AGIs will understand what we want, but just not care, because the motivations they acquired during training weren’t those we intended them to have.”

Perverse instantiation is a specific instance of this, where the motivations/goals that the AGI acquired during training match our literal description but diverge from our intention. The AGI would likely understand our real intentions, but not care about those.

_________________________________________________________________________

“We might hope that by carefully choosing the tasks on which agents are trained, we can prevent those agents from developing goals that conflict with ours, without requiring any breakthroughs in technical safety research. Why might this not work, though?”

Very important in my opinion.

_________________________________________________________________________

4.1 Outer and inner misalignment: the standard picture

“Even if we solve outer alignment by specifying a “safe” objective function, though, we may still encounter a failure of inner alignment: our agents might develop goals which differ from the ones specified by that objective function. This is likely to occur when the training environment contains subgoals which are consistently useful for scoring highly on the given objective function, such as gathering resources and information, or gaining power.
If agents reliably gain higher reward after achieving such subgoals, then the optimiser might select for agents which care about those subgoals for their own sake.”

Agreed, though I feel like there might be some confusion here.

As far as I understand, mesa-optimization arises out of the environment being consistently more specific than the objective requires.
As an example, evolution might select for animals that can recognize fruit—but if the particular environment that these animals inhabit only ever features red fruit, the animal will end up rejecting green fruit and perhaps eat poisonous red things in a different environment. It is simply an issue of the objective that the animal internalized not being correctly abstract/general.

Now in the example, influence-seeking behaviors are abstract behaviors that will be selected for in a wide variety of contexts, as long as we reward for competence. They are actually more general than what we are likely to reward the agent for.
There are thus two issues here:

One relates to mesa-optimisation, e.g. the AGI caring about NPC humans more than about real humans.

The other relates to the dilemma of rewarding competence and how that usually selects for influence-seeking behavior. We could try to punish it in so far as we can recognize it, but that seems problematic in that it might select for a type of influence-seeking that we don’t recognize either because it is deceptive or because we simply fail to notice (the difference being the agent’s encounter with deception).

_________________________________________________________________________

“Of course, late in training we expect our AIs to have become intelligent enough that they’ll understand exactly what goals we intended to give them. But by that time their existing motivations may be difficult to remove, and they’ll likely also be intelligent enough to attempt deceptive behaviour”

Strong point, it is mostly about the system’s early development, then, when we can still influence its motivations with certainty and detect deception (though I believe there should be information theory on setups where the agent can’t know whether it is in a position to successfully deceive us or not, so maybe that’s hopeful for testing/research on more intelligent agents).

I think that it might be relevant to think of a simplified notion of human intentions that can be accurately scaled to a complete one. That simplified notion is where we need to “catch” the AI and also evaluate whether we managed to. This is obviously easier the simpler the composite system yet is.
As an example, a two-part process: First an AI seed configurationally reaches a minimum degree of general intelligence, whereby we can change its goals relatively freely depending on the lessons we want it to learn, and then, secondly, we instil the simplified notion and keep the agent on that level extensively for testing before scaling it up and testing on multiple occasions whether it translates/refines the simple notion to a more complex environment in a desirable way.

_________________________________________________________________________

“One potential approach involves adding training examples where the behaviour of agents motivated by misaligned goals diverges from that of aligned agents. Yet designing and creating this sort of adversarial training data is currently much more difficult than mass-producing data”

A very challenging problem. It seems that an experimental training set-up where we can explore what types of training have which effects on a promising architecture is the way to go about this.

_________________________________________________________________________

4.2 A more holistic view of alignment

“As a particularly safety-relevant example, neural networks can be modified so that their loss on a task depends not just on their outputs, but also on their internal representations”

I’d like to see more discussion about this. Shaping internal representations appears as an endeavor very much in line with interpretability research.
_________________________________________________________________________

“objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances”

I really like this.
_________________________________________________________________________

“It’s not the case that AIs will inevitably end up thinking in terms of large-scale consequentialist goals, and our choice of reward function just determines which goals they choose to maximise. Rather, all the cognitive abilities of our AIs, including their motivational systems, will develop during training.
The objective function (and the rest of the training setup) will determine the extent of their agency and their attitude towards the objective function itself! This might allow us to design training setups which create pressures towards agents which are still very intelligent and capable of carrying out complex tasks, but not very agentic—thereby preventing misalignment without solving either outer alignment or inner alignment. “

I think this is a very promising perspective, my remaining concern mostly being about the exploitability of non-agentic systems.

It also appears to me that such a setup will enable us to train competent agentic systems significantly before enabling us to train competent non-agentic systems. In a way, that is just the “alignment tax” all over again, and probably requires a solution in terms of research coordination, rather than an improvement in our conceptual understanding.

_________________________________________________________________________

5 Control

5.1 Disaster scenarios

5.2 Speed of AI development

“1. The development of AGI will be a competitive endeavour in which many researchers will aim to build general cognitive capabilities into their AIs, and will gradually improve at doing so. This makes it unlikely that there will be low-hanging fruit which, when picked, allow large jumps in capabilities. (Arguably, cultural evolution was this sort of low-hanging fruit during human evolution, which would explain why it facilitated such rapid progress.) “

It is not clear to which degree gradual improvements will be the dominant mode of progress for such teams. Improvement in cognition can be very explosive (especially from our perspective, given that these systems can in principle operate at the speed of light), since cognition has an exaptive quality to it.
Two analogies:
1. The human immune system can deal with all manner of harmful viruses and bacteria, because it is sufficiently expressive to build receptors of arbitrary form that can bind together offending visitors.
2. When using vectors to explore a multi-dimensional space, adding just a single vector can open up the exploration of a new dimension, if now the relevant base vector can be formed through linear combination.
We understand that neural networks are fully computationally expressive, but our understanding about which training regimes or modular setups enable an efficient search that fully exploits that expressivity is still lacking.
_________________________________________________________________________

“2. Compute availability, which on some views is the key driver of progress in AI, increases fairly continuously. “

Trying to inhabit the counter position: Well, we have Moore’s law, more specialized processor designs and increased spending, all contributing to compute availability—that does seem pretty explosive if you take into account the possibility of algorithmic innovation or exploitation of modular potential, which is the term I use for the different ways of sticking things together that we already have.

_________________________________________________________________________

“3. Historically, continuous technological progress has been much more common than discontinuous progress [Grace, 2018b]. For example, progress on chess-playing AIs was steady and predictable over many decades [Grace, 2018a]. “

If we make a factor analysis on the factors driving the development of those instances of technological progress, compared to progress in AGI research, I’d expect that we would find additional factors driving AGI research.

_________________________________________________________________________

5.3 Transparency of AI systems

“A second approach is to create training incentives towards transparency. For example, we might reward an agent for explaining its thought processes, or for behaving in predictable ways. Interestingly, ideas such as the cooperative eye hypothesis imply that this occurred during human evolution, which suggests that multi-agent interactions might be a useful way to create such incentives (if we can find a way to prevent incentives towards deception from also arising). “

I believe this to be a great research direction. It strikes me as more promising than the first one. (First one was interpretability tools)

_________________________________________________________________________

“A third approach is to design algorithms and architectures that are inherently more interpretable.”

I think this would heavily inhibit competence, though perhaps not “fatally”. One might also reason that this would make our systems more human-like, more so than the second approach.

_________________________________________________________________________

5.4 Constrained deployment strategies

5.5 Human political and economic coordination

“It will be much easier to build a consensus on how to deal with superintelligence if AI systems approach then surpass human-level performance over a timeframe of decades, rather than weeks or months. This is particularly true if less-capable systems display misbehaviour which would clearly be catastrophic if performed by more capable agents.”

I would think that we are already in a position to demonstrate the second point, are we not?

_________________________________________________________________________

6 Conclusion

I am pretty happy with the conclusion made in the post. As an individual, I feel like the danger is more clear than what the overall discussion seems to point at, but that is to some degree the purpose of the discussion. I pointed out my gripes with the second-species argument, but since it boils down to the same conclusion here, I am happy to go along with the reasoning. I also agree with his ordering of the certainty regarding the 4 points.