What’s the best way to share your progress skilling up in AI alignment? Maybe it’s highly polished posts on your well-considered and nuanced insights in to alignment. Or maybe it’s massively overconfident takes you should really just keep in your private drafts. Let’s find out with the power of science. Here is the experiment:
I dump a raw extract here of my research notes from reading through the core material of Richard Ngo’s Safety Fundementals course. This only covers “new” ideas I had (quotation marks, as I have yet to generate an idea I can’t find back in the literature). These are not redacted, contain some weird grammar, jumps in reasoning, and many stupid questions. Then we see if anyone gets anything out of it. If so, we all learn something. If not, I mentally award myself another badge for awkward self-disclosure.
A little more context before we dive in: This is just from my “ideas” page on the Obsidian Vault I started with all my AIS readings. I can highly recommend this tool, and shout out to jhoogland for making me aware of it. Additionally, I started thinking about AIS about 3-4 months ago in my spare time, so my questions, models, and hypotheses are still quite naive.
With all those disclaimers out of the way, here we go:
Raw Notes
Can we build a population of divergent or varied mini AI’s that work together, such that any one bad actor/unaligned AI is suppressed by the rest? This is what happens in evolution both on the cellular and organismic level. It would mirror an AI immune system, in a way. But it would require AI to be built up modularly. It would avoid failure modes related to homogeneity. - Idea source: Intelligence Explosion Extract. Maybe look in to Society of Mind by Minsky?
The Alignment Tax on this seems insanely high cause you basically get very inefficient systems? Source
Coordination cost increases superlinearly with scale, so you don’t want a huge number of modular AI’s coordinating cause then the Alignment Tax is through the roof. However, is there an optimal number of AGI that you can have coordinating such that the Alignment Tax is manageable while you achieve Robustness/Immune System to misalignment? So like … what are the arguments for purposefully moving to a multi-agent equilibrium for AGI?
How have evo algos been combined with NN/DL? Can we set up a population of AGI’s to evolve toward optimality which includes alignment? What would alignment look like as part of a fitness function (Outer Alignment)?
A collective of AGI’s, say 10 can keep check on each other, but would most likely cut humans out of the loop. Is there any way to stop AGI’s from coordinating with each other? (probably not …)
How do we make truth-seeking AI? Would that be a solution for Goodhart’s Law? Idea spawned off What Failure Looks Like (Christiano, 2019). What makes truth-seeking important? When you rely on results to survive or thrive. Which is true when the environment challenges you (competition with other agents, or actual environmental challenges). If there are none of those, then truth has no value… Truth only has value if actual results matter more than measurements. We need to construct AGI such that it’s actual results matter more than the measurements of those results.
Work out the wish list on the whiteboard
Why would HLRF ever work? Our outputs can be manipulated and are imperfect signals/proxies for our true preferences. I’d like to ignore HLRF but might need to construct a proper argument around this? Or not?
Can I write a “proof” on why we shouldn’t rely on human feedback? Then at least I can make a case to make others stop using it, or I’ll be convinced we should use it if I fail.
HLRF is good for capabilities, but bad for alignment. Imho. Why? Cause it helps speed up the process toward human-comparative intelligence, but then won’t scale past that as we can’t assess superhuman plans anymore! So this is a bad attractor in research space, cause it boosts capabilities, creates an illusion of alignment, but will actually break once we can’t recover anymore.
“Our algorithm’s performance is only as good as the human evaluator’s intuition about what behaviors look correct, so if the human doesn’t have a good grasp of the task they may not offer as much helpful feedback.” (From: Learning from Human Preferences, (Christiano, 2017))
Subhuman intelligence is a tool. Superhuman intelligence is a new life form. And we’ll be it’s tools. We shouldn’t want to create a new life form that will dominate over us per definition. We should keep AI dumber than us, and work on uplifting ourselves.
Integrate human Collective Intelligence with AGI, such that they scale together. Probably requires some brand of Superhumans to actually keep pace and not accidentally be outstripped, but unsure. Idea spawned based on Summarizing Books with Human Feedback (Wu, 2021). Iterated Distillation-Amplification seems pointless though cause the humans need to scale with the AGI, and so we are still back to Superhumans. Also the IDA paradigm would run so slowly, it could never compete with RL and then you end up with the same argument as Tool AI vs Agent AI.
Should I write a list of bad assumptions people keep making in alignment work? E.g., that multiple AGI’s won’t coordinate through an unseen channel (See AI Safety via Debate for missing this assumption), or that agency is separate from intelligence in any practical case (I don’t see how that could be true?), or that suffering is a relevant risk from AGI (suffering is inefficient, it’s an anti-convergent goal).
Can we string multiple techniques together to mitigate all issues? But how would you mitigate something like multiple AGI’s coordinating?
Is our cognition the same as artificial cognition? Like do we run the same algorithm, basically? And if not, can AGI just copy our algorithm and run off us? Probably yes, but unsure. Needs more thought.
Are there areas of intermediate danger we could deploy AGI in to? As in, areas that might cause catastrophic harm but not extinction. Analoguous examples would be nuclear reactors blowing up (mid range) versus global nuclear war (extinction risk). Right now the discussion is along the lines of ‘safe’ (e.g. training) versus dangerous environments (the world) but if there is something in between then 1) we could iterate, 2) we could mobilize humanity against AGI development till it’s safe cause of the MNM effect
The “Subsystem alignment” problem in Embedded Agents (Garrabrant, 2018) is about how to ensure subagents don’t take over. Subagents sound like a form of Factored Cognition, but there they deal with the conflict by making every subsystem identical to the higher system, I think? Diversity of (sub)agents creates conflicts of goals/motives. Isn’t this similar to the Simulacra problem? Would it be efficient to make a copy of yourself that is then dismissed at time T such that your copy does not have the time to diverge?
Gather a List of Common AGI intuitions. Review their arguments and counterarguments. Map how combinations of different intuitions end up in different places of problem and solution space.
Superhumans won’t work according to Yudkowsky cause:
1. Human brains are too fragile. Augmenting them are likely to make them go insane. 2. It’s much slower than AGI development cause scaling up an existing design is harder than building one from scratch to a bigger scale 3. Ethical constraints of fiddling with brains 4. Same problem of generating an unaligned superhuman intelligence: You make the human much smarter, but break them in a way they go insane. Then they will still try to kill you or build AGI. Except now you are iterating over humans you need to kill each time …
My counter arguments to each: 1. Human brains just seem fragile cause we can’t freely experiment without hurting people badly. Programmers break their code continuously. 2. Scaling up the existing design would be faster than starting from scratch if we *truly* understood the design! But we don’t, cause we can’t experiment/iterate. 3. This is the iteration problem. We could solve this is we could fully simulate the human brain.… oooh, that’s why Yudkowsky starts with that big ask instead of poking brains and working your way up… But simulation of the brain is more harder than AGI, so then AGI wins again… So back to the human trial iteration problem. 4. Yes, this would be the actual alignment problem in humans, but you have an a priori higher chance of a superhumans being aligned with humans than an AGI cause our alignment is anchored in our brains, and a superhuman’s brain descends from us!
Conclusion 1: Iteration Problem
We need to solve the Iteration Problem of Superhuman General Intelligence. - Solving it for AI means ensuring alignment before deployment - Solving it for humans means ensuring alignment before enhancement and figuring out how to iterate over human experiments.
Conclusion 2: Alignment Anchor Problem
We need to solve how to point a superhuman intelligence at the values that are embedded in our minds and nowhere else. - Solving for AI means finding a perfect extraction from our mind in to a conceptual truth that can’t be afflicted by Goodhart’s Law - Solving for humans means stabilizing the values/reward circuits in our mind such that cognitive enhancement will not shift or destroy them.
Let’s Compare Notes
What’s the best way to share your progress skilling up in AI alignment? Maybe it’s highly polished posts on your well-considered and nuanced insights in to alignment. Or maybe it’s massively overconfident takes you should really just keep in your private drafts. Let’s find out with the power of science. Here is the experiment:
I dump a raw extract here of my research notes from reading through the core material of Richard Ngo’s Safety Fundementals course. This only covers “new” ideas I had (quotation marks, as I have yet to generate an idea I can’t find back in the literature). These are not redacted, contain some weird grammar, jumps in reasoning, and many stupid questions. Then we see if anyone gets anything out of it. If so, we all learn something. If not, I mentally award myself another badge for awkward self-disclosure.
A little more context before we dive in: This is just from my “ideas” page on the Obsidian Vault I started with all my AIS readings. I can highly recommend this tool, and shout out to jhoogland for making me aware of it. Additionally, I started thinking about AIS about 3-4 months ago in my spare time, so my questions, models, and hypotheses are still quite naive.
With all those disclaimers out of the way, here we go:
Raw Notes
Can we build a population of divergent or varied mini AI’s that work together, such that any one bad actor/unaligned AI is suppressed by the rest? This is what happens in evolution both on the cellular and organismic level. It would mirror an AI immune system, in a way. But it would require AI to be built up modularly. It would avoid failure modes related to homogeneity. - Idea source: Intelligence Explosion Extract. Maybe look in to Society of Mind by Minsky?
The Alignment Tax on this seems insanely high cause you basically get very inefficient systems? Source
Coordination cost increases superlinearly with scale, so you don’t want a huge number of modular AI’s coordinating cause then the Alignment Tax is through the roof. However, is there an optimal number of AGI that you can have coordinating such that the Alignment Tax is manageable while you achieve Robustness/Immune System to misalignment? So like … what are the arguments for purposefully moving to a multi-agent equilibrium for AGI?
How have evo algos been combined with NN/DL? Can we set up a population of AGI’s to evolve toward optimality which includes alignment? What would alignment look like as part of a fitness function (Outer Alignment)?
A collective of AGI’s, say 10 can keep check on each other, but would most likely cut humans out of the loop. Is there any way to stop AGI’s from coordinating with each other? (probably not …)
How do we make truth-seeking AI? Would that be a solution for Goodhart’s Law? Idea spawned off What Failure Looks Like (Christiano, 2019). What makes truth-seeking important? When you rely on results to survive or thrive. Which is true when the environment challenges you (competition with other agents, or actual environmental challenges). If there are none of those, then truth has no value… Truth only has value if actual results matter more than measurements. We need to construct AGI such that it’s actual results matter more than the measurements of those results.
Work out the wish list on the whiteboard
Why would HLRF ever work? Our outputs can be manipulated and are imperfect signals/proxies for our true preferences. I’d like to ignore HLRF but might need to construct a proper argument around this? Or not?
Can I write a “proof” on why we shouldn’t rely on human feedback? Then at least I can make a case to make others stop using it, or I’ll be convinced we should use it if I fail.
HLRF is good for capabilities, but bad for alignment. Imho. Why? Cause it helps speed up the process toward human-comparative intelligence, but then won’t scale past that as we can’t assess superhuman plans anymore! So this is a bad attractor in research space, cause it boosts capabilities, creates an illusion of alignment, but will actually break once we can’t recover anymore.
“Our algorithm’s performance is only as good as the human evaluator’s intuition about what behaviors look correct, so if the human doesn’t have a good grasp of the task they may not offer as much helpful feedback.” (From: Learning from Human Preferences, (Christiano, 2017))
See the The easy goal inference problem is still hard (Christiano, 2018) for more notes on this [I wrote more notes in Obsidian]
Subhuman intelligence is a tool. Superhuman intelligence is a new life form. And we’ll be it’s tools. We shouldn’t want to create a new life form that will dominate over us per definition. We should keep AI dumber than us, and work on uplifting ourselves.
Second take: Check AI Versus Human Intelligence Enhancement—Chapter 12, (Eliezer, 2008) notes (listed in next section) about my new ideas on the Superhuman Intelligence Iteration Problem and the Alignment Anchoring Problem.
Integrate human Collective Intelligence with AGI, such that they scale together. Probably requires some brand of Superhumans to actually keep pace and not accidentally be outstripped, but unsure. Idea spawned based on Summarizing Books with Human Feedback (Wu, 2021). Iterated Distillation-Amplification seems pointless though cause the humans need to scale with the AGI, and so we are still back to Superhumans. Also the IDA paradigm would run so slowly, it could never compete with RL and then you end up with the same argument as Tool AI vs Agent AI.
Should I write a list of bad assumptions people keep making in alignment work? E.g., that multiple AGI’s won’t coordinate through an unseen channel (See AI Safety via Debate for missing this assumption), or that agency is separate from intelligence in any practical case (I don’t see how that could be true?), or that suffering is a relevant risk from AGI (suffering is inefficient, it’s an anti-convergent goal).
Can we string multiple techniques together to mitigate all issues? But how would you mitigate something like multiple AGI’s coordinating?
Is our cognition the same as artificial cognition? Like do we run the same algorithm, basically? And if not, can AGI just copy our algorithm and run off us? Probably yes, but unsure. Needs more thought.
Are there areas of intermediate danger we could deploy AGI in to? As in, areas that might cause catastrophic harm but not extinction. Analoguous examples would be nuclear reactors blowing up (mid range) versus global nuclear war (extinction risk). Right now the discussion is along the lines of ‘safe’ (e.g. training) versus dangerous environments (the world) but if there is something in between then 1) we could iterate, 2) we could mobilize humanity against AGI development till it’s safe cause of the MNM effect
The “Subsystem alignment” problem in Embedded Agents (Garrabrant, 2018) is about how to ensure subagents don’t take over. Subagents sound like a form of Factored Cognition, but there they deal with the conflict by making every subsystem identical to the higher system, I think? Diversity of (sub)agents creates conflicts of goals/motives. Isn’t this similar to the Simulacra problem? Would it be efficient to make a copy of yourself that is then dismissed at time T such that your copy does not have the time to diverge?
Gather a List of Common AGI intuitions. Review their arguments and counterarguments. Map how combinations of different intuitions end up in different places of problem and solution space.
More notes on Superhumans
Based on AI Versus Human Intelligence Enhancement—Chapter 12, (Eliezer, 2008)
Superhumans won’t work according to Yudkowsky cause:
1. Human brains are too fragile. Augmenting them are likely to make them go insane.
2. It’s much slower than AGI development cause scaling up an existing design is harder than building one from scratch to a bigger scale
3. Ethical constraints of fiddling with brains
4. Same problem of generating an unaligned superhuman intelligence: You make the human much smarter, but break them in a way they go insane. Then they will still try to kill you or build AGI. Except now you are iterating over humans you need to kill each time …
My counter arguments to each:
1. Human brains just seem fragile cause we can’t freely experiment without hurting people badly. Programmers break their code continuously.
2. Scaling up the existing design would be faster than starting from scratch if we *truly* understood the design! But we don’t, cause we can’t experiment/iterate.
3. This is the iteration problem. We could solve this is we could fully simulate the human brain.… oooh, that’s why Yudkowsky starts with that big ask instead of poking brains and working your way up… But simulation of the brain is more harder than AGI, so then AGI wins again… So back to the human trial iteration problem.
4. Yes, this would be the actual alignment problem in humans, but you have an a priori higher chance of a superhumans being aligned with humans than an AGI cause our alignment is anchored in our brains, and a superhuman’s brain descends from us!
Conclusion 1: Iteration Problem
We need to solve the Iteration Problem of Superhuman General Intelligence.
- Solving it for AI means ensuring alignment before deployment
- Solving it for humans means ensuring alignment before enhancement and figuring out how to iterate over human experiments.
Conclusion 2: Alignment Anchor Problem
We need to solve how to point a superhuman intelligence at the values that are embedded in our minds and nowhere else.
- Solving for AI means finding a perfect extraction from our mind in to a conceptual truth that can’t be afflicted by Goodhart’s Law
- Solving for humans means stabilizing the values/reward circuits in our mind such that cognitive enhancement will not shift or destroy them.