The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you’re more dubious about the development of AGI.
The definition of situational awareness feels quite vague to me. To me, the definition (“identifying which abstract knowledge is relevant to the context in which they’re being run, and applying that knowledge when choosing actions”) seems to include encompass, for example, the ability to ingest information such as “pawns can attack diagonally” and apply that to playing a game of chess. Ajeya’s explanation of situational awareness feels much clearer to me.
Shah et al. [2022] speculate that InstructGPT’s competent responses to questions its developers didn’t intend it to answer (such as questions about how to commit crimes) was a result of goal misgeneralization.
Taking another look at Shah et al., this doesn’t seem like a strong example to me.
Secondly, there are reasons to expect that policies with broadly-scoped misaligned goals will constitute a stable attractor which consistently receives high reward, even when policies with narrowlyscoped versions of these goals receive low reward (and even if the goals only arose by chance). We explore these reasons in the next section.
This claim felt confusing to me, and it wasn’t immediately clear to me how the following section, “Power-seeking behavior”, supported this claim. But I guess if you have a misaligned goal of maximizing paperclips over the next hour vs maximizing paperclips over the very long term, I see how the narrowly-scoped goal would receive low reward as the AI soon gets caught, while the broadly-scoped goal would receive high reward.
Assisted decision-making: AGIs deployed as personal assistants could emotionally manipulate human users, provide biased information to them, and be delegated responsibility for increasingly important tasks and decisions (including the design and implementation of more advanced AGIs), until they’re effectively in control of large corporations or other influential organizations. An early example of AI persuasive capabilities comes from the many users who feel romantic attachments towards chatbots like Replika [Wilkinson, 2022].
I don’t think Replika is a good example of “persuasive abilities” – it doesn’t really persuade users to do much of anything.
Regardless of how it happens, though, misaligned AGIs gaining control over these key levers of power would be an existential threat to humanity
The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you’re more dubious about the development of AGI.
IMO this. For a legible paper, you more or less shouldn’t assume it, but rather suggest consequences.
The definition of situational awareness feels quite vague to me. To me, the definition (“identifying which abstract knowledge is relevant to the context in which they’re being run, and applying that knowledge when choosing actions”) seems to include encompass, for example, the ability to ingest information such as “pawns can attack diagonally” and apply that to playing a game of chess. Ajeya’s explanation of situational awareness feels much clearer to me.
Some thoughts:
The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you’re more dubious about the development of AGI.
The definition of situational awareness feels quite vague to me. To me, the definition (“identifying which abstract knowledge is relevant to the context in which they’re being run, and applying that knowledge when choosing actions”) seems to include encompass, for example, the ability to ingest information such as “pawns can attack diagonally” and apply that to playing a game of chess. Ajeya’s explanation of situational awareness feels much clearer to me.
Taking another look at Shah et al., this doesn’t seem like a strong example to me.
This claim felt confusing to me, and it wasn’t immediately clear to me how the following section, “Power-seeking behavior”, supported this claim. But I guess if you have a misaligned goal of maximizing paperclips over the next hour vs maximizing paperclips over the very long term, I see how the narrowly-scoped goal would receive low reward as the AI soon gets caught, while the broadly-scoped goal would receive high reward.
I don’t think Replika is a good example of “persuasive abilities” – it doesn’t really persuade users to do much of anything.
The section “Misaligned AGIs could gain control of key levers of power” feels underdeveloped. I think it might be helpful to including additional examples, such as ones from What could an AI-caused existential catastrophe actually look like? − 80,000 Hours.
IMO this. For a legible paper, you more or less shouldn’t assume it, but rather suggest consequences.
Yeah, I think copying Ajeya is good.