Cognitive Emulation: A Naive AI Safety Proposal
This is part of the work done at Conjecture.
This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback.
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.
Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.
We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.[1]
In Brief
The core intuition is that instead of building powerful, Magical[2] end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.”
CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it.
Logical, Not Physical Emulation
We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes.[3] We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. [4]
Minimize Magic
In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why.
CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum.[5]
Predict, Track and Bound Capabilities
In the current dominant machine learning paradigm[6], there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have.
One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime.
Exploit the Human Regime
We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties.[7] Any scheme that involves building systems that involves humans should allow you to swap those humans for CoEms without breaking or drastically altering their behavior.[8]
We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime. The problem of why e.g. normal laws and regulations will not work for AGI is that we have no way of ensuring that the AGI that gets built will obey the capabilities constraints[9] that are implicitly assumed in our social and legal mechanism design. By the definition of the paradigm, CoEms have the property of being understandable enough that we can ensure they implement human level capabilities and allow the user to ensure this regime isn’t accidentally exited.[10]
Human(ish), Corrigible(ish)
We observe that humans are sorta, kinda, sometimes, corrigible-ish. They can focus their attention on many different things and optimize decently hard. Speedrunners can put massive amounts of effort into optimizing for relatively arbitrary numbers going down, scientists can design experiments and generate new knowledge, workers can generally be paid to perform many kinds of tasks.
We expect that this retargetability is something that can rather naturally be achieved in the process of implementing human-like cognition, and that many other factors, such as emotions, values, desires, etc are mostly contingent and can be factored out to a large extent.
=> CoEm
By combining these five factors, we should be left with a system that:
Is built on understandable, discoverable and implementable ML and computational building blocks.
Does not have so much Magic inside of it that we cannot even put bounds on its possible consequences and capabilities.[11]
Can be sufficiently understood and bounded to ensure it does not suddenly dramatically shift its behaviors, properties and capabilities.[12]
Is well situated in the human(ish) capabilities regime and, when in doubt, will default to human-like failure modes rather than completely unpredictable behaviors.
Is retargetable enough to be deployed to solve many useful problems and not deviate into dangerous behavior, along as it is used by a careful user.
Conclusion
Instead of building black box, end-to-end Magical systems, we suggest composing simpler systems and reintegrate human’s knowledge into the development process. While this is a slower path to get to AGI[13], we believe it to be much safer.
There is a massive amount of alignment insights that can be gained purely from mining current level systems, and we should focus on exhausting those insights before pushing the capabilities frontier further.
CoEms, if successful, would not be strongly aligned CEV agents that can be left unsupervised to pursue humanity’s best interests. Instead, CoEms would be a strongly constrained subspace of AI designs that limit[14] systems from entering into regimes of intelligence and generality that would violate the assumptions that our human-level systems and epistemology can handle.
Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.
We think this is a promising approach to ending the acute risk period before the first AGI is deployed.
- ^
Similar ideas have been proposed by people and organizations such as Chris Olah, Ought and, to a certain degree, John Wentworth, Paul Christiano, MIRI, and others.
- ^
When we use the word “Magic” (capitalized), we are pointing at something like “blackbox” or “not-understood computation”. A very Magical system is a system that works very well, but we don’t know why or how it accomplishes what it does. This includes most of modern ML, but a lot of human intuition is also (currently) not understood and would fall under Magic.
- ^
While Robin Hanson has a historical claim to the word “em” to refer to simulations of physical human brains, we actually believe we are using the word “emulation” more in line with what it usually means in computer science.
- ^
In other words, if we implement some kind of human reasoning, we don’t care whether under the hood it is implemented with neural networks, or traditional programming, or whatever. What we care about is that a) its outputs and effects emulate what the human mind would logically do and b) it does not “leak”. By “leak” we mean something like “no unaccounted for weirdness happens in the background by default, and if it does, it’s explicit.” For example, in the Rust programming language, by default you don’t have to worry about unsafe memory accesses, but you have a special “unsafe” keyword you can use to mark a section of code as no longer having these safety guarantees, this way you can always know where the Magic is happening, if it is happening. We want similar explicit tracking of Magic.
- ^
The Safety Juice™ that makes e.g. Eliezer like Ems/Uploads as a “safe” approach to AGI comes from a fundamentally different source than in CoEms. Ems are “safe” because we trust the generating process (we trust that uploading results in an artifact that faithfully acts like the uploaded human would), but the generated artifact is a black box. In CoEms, we aim to make an artifact that is in itself understandable/”safe”.
- ^
Roughly defined as something like “big pretrained models + finetuning + RL + other junk.”
- ^
Note that we are not saying humans are “inherently aligned”, or robust to being put through 100000 years of RSI, or whatever. We don’t expect human cognition to be unusually robust to out of distribution weirdness in the limit. The benefit comes from us as a species being far more familiar with what regimes human cognition does operate ok(ish) in…or at least in which the downsides are bounded to acceptable limits.
- ^
This is a good litmus test for whether you are actually building CoEms, or just slightly fancier unaligned AGI.
- ^
And other constraints, e.g. emotional, cultural, self-preservational etc.
- ^
A malicious or negligent users could still absolutely fuck this all up, of course. CoEms aren’t a solution to misuse, but instead a proposal for getting us from “everything blows up always” to “it is possible to not blow things up”.
- ^
For example, with GPT3 many, many capabilities were only discovered long after it was deployed, and new use cases (and unexplainable failure modes) for these kinds of models still are being discovered all the time.
- ^
As we have already observed with e.g. unprecedented GPT3 capabilities and RL misbehavior.
- ^
In the same way that AlphaZero is a more powerful, and in some sense “simpler” chess system than Deep Blue, which required a lot of bespoke human work, and was far weaker.
- ^
Or rather “allow the user to limit”.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 322 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 322 points) (
- The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda by 18 Dec 2023 20:35 UTC; 166 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 5:52 UTC; 150 points) (EA Forum;
- Charbel-Raphaël and Lucius discuss interpretability by 30 Oct 2023 5:50 UTC; 105 points) (
- AI #4: Introducing GPT-4 by 21 Mar 2023 14:00 UTC; 101 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 89 points) (EA Forum;
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:33 UTC; 76 points) (EA Forum;
- My guess at Conjecture’s vision: triggering a narrative bifurcation by 6 Feb 2024 19:10 UTC; 75 points) (
- Questions about Conjecure’s CoEm proposal by 9 Mar 2023 19:32 UTC; 51 points) (
- Steering systems by 4 Apr 2023 0:56 UTC; 50 points) (
- 10 quick takes about AGI by 20 Jun 2023 2:22 UTC; 35 points) (
- AI Safety Strategies Landscape by 9 May 2024 17:33 UTC; 34 points) (
- Could We Automate AI Alignment Research? by 10 Aug 2023 12:17 UTC; 32 points) (
- Philosophical Cyborg (Part 1) by 14 Jun 2023 16:20 UTC; 31 points) (
- A couple of questions about Conjecture’s Cognitive Emulation proposal by 11 Apr 2023 14:05 UTC; 30 points) (
- Conjecture: A standing offer for public debates on AI by 16 Jun 2023 14:33 UTC; 29 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 25 points) (
- My Alignment “Plan”: Avoid Strong Optimisation and Align Economy by 31 Jan 2024 17:03 UTC; 24 points) (
- The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda by 18 Dec 2023 21:13 UTC; 21 points) (EA Forum;
- EA & LW Forum Weekly Summary (27th Feb − 5th Mar 2023) by 6 Mar 2023 3:18 UTC; 20 points) (EA Forum;
- Questions about Conjecure’s CoEm proposal by 9 Mar 2023 19:32 UTC; 19 points) (EA Forum;
- Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?” by 8 May 2023 21:26 UTC; 18 points) (
- An LLM-based “exemplary actor” by 29 May 2023 11:12 UTC; 16 points) (
- Gradual takeoff, fast failure by 16 Mar 2023 22:02 UTC; 15 points) (
- Safety-First Agents/Architectures Are a Promising Path to Safe AGI by 6 Aug 2023 8:02 UTC; 13 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 1:32 UTC; 12 points) (
- EA & LW Forum Weekly Summary (27th Feb − 5th Mar 2023) by 6 Mar 2023 3:18 UTC; 12 points) (
- H-JEPA might be technically alignable in a modified form by 8 May 2023 23:04 UTC; 12 points) (
- Conjecture: A standing offer for public debates on AI by 16 Jun 2023 14:33 UTC; 8 points) (EA Forum;
- How do we align humans and what does it mean for the new Conjecture’s strategy by 28 Mar 2023 17:54 UTC; 7 points) (
- A response to Conjecture’s CoEm proposal by 24 Apr 2023 17:23 UTC; 7 points) (
- 27 Feb 2023 18:56 UTC; 7 points) 's comment on Eliezer is still ridiculously optimistic about AI risk by (
- Safety-First Agents/Architectures Are a Promising Path to Safe AGI by 6 Aug 2023 8:00 UTC; 6 points) (EA Forum;
- 2 May 2023 19:48 UTC; 6 points) 's comment on AGI safety career advice by (EA Forum;
- 16 Jun 2023 11:55 UTC; 5 points) 's comment on Scaffolded LLMs: Less Obvious Concerns by (
- 7 Jun 2023 21:07 UTC; 5 points) 's comment on An Exercise to Build Intuitions on AGI Risk by (
- Worrisome misunderstanding of the core issues with AI transition by 18 Jan 2024 10:05 UTC; 5 points) (
- Worrisome misunderstanding of the core issues with AI transition by 18 Jan 2024 10:05 UTC; 4 points) (EA Forum;
- Scientism vs. people by 18 Apr 2023 17:28 UTC; 4 points) (
- 18 Aug 2023 3:34 UTC; 4 points) 's comment on ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by (
- 7 Jun 2023 15:38 UTC; 4 points) 's comment on A Playbook for AI Risk Reduction (focused on misaligned AI) by (
- 22 Sep 2023 16:03 UTC; 3 points) 's comment on Interpretability Externalities Case Study—Hungry Hungry Hippos by (
- 2 May 2023 19:42 UTC; 3 points) 's comment on AGI safety career advice by (
- 17 Jun 2023 13:29 UTC; 3 points) 's comment on Scaffolded LLMs: Less Obvious Concerns by (
- 24 Apr 2023 9:27 UTC; 3 points) 's comment on Capabilities and alignment of LLM cognitive architectures by (
- 26 Apr 2023 20:24 UTC; 3 points) 's comment on Ryan Kidd’s Shortform by (
- 20 Jul 2023 19:01 UTC; 2 points) 's comment on Using predictors in corrigible systems by (
- 24 Apr 2023 0:36 UTC; 2 points) 's comment on How did LW update p(doom) after LLMs blew up? by (
- 20 Apr 2023 18:34 UTC; 2 points) 's comment on An open letter to SERI MATS program organisers by (
- 8 Mar 2023 21:14 UTC; 2 points) 's comment on Cognitive Emulation: A Naive AI Safety Proposal by (
- Proposal: Using Monte Carlo tree search instead of RLHF for alignment research by 20 Apr 2023 19:57 UTC; 2 points) (
- 10 Mar 2023 13:07 UTC; 1 point) 's comment on Anthropic’s Core Views on AI Safety by (
- Scientism vs. people by 18 Apr 2023 17:52 UTC; 0 points) (EA Forum;
What? A major reason we’re in the current mess is that we don’t know how to do this. For example we don’t seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don’t act like Hollywood villains (race for AI to make a competitor ‘dance’)? Even our “AGI safety” organizations don’t behave safely (e.g., racing for capabilities, handing them over to others, e.g. Microsoft, with little or no controls on how they’re used). You yourself wrote:
How is this compatible with the quote above?!
Well, we are not very good at it, but generally speaking, however capitalism seems to be acting to degrade our food, food companies are not knowingly routinely putting poisonous additives in food.
And however bad medicine is, it does seem to be a net positive these days.
Both of these things are a big improvement on Victorian times!
So maybe we are a tiny bit better at it than we used to be?
Not convinced it actually helps, mind....
Food companies are adding sesame (an allergen for some) to food in order to not be held responsible for it not containing sesame. Alloxan is used to whiten dough https://www.sciencedirect.com/science/article/abs/pii/S0733521017302898 for the it’s false comment. And is also used to induce diabetes in the lab https://www.sciencedirect.com/science/article/abs/pii/S0024320502019185 RoundUp is in nearly everything.
https://en.m.wikipedia.org/wiki/List_of_withdrawn_drugs#Significant_withdrawals plenty of things keep getting added to this list.
We have never made a safe human. CogEms would be safer than humans though because they won’t unionize and can be flipped off when no longer required.
Edit: sources added for the x commenter.
Can you list a concrete research path which you’re pursuing in light of this strategy? This all sounds ok in principle, but I’d bet alignment problems show up in concrete pathways.
Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?
My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they’re made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I’ve done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear (!) single-layer (!!) regression models, not anything like a LLM. How much progress does Conjecture expect to really make? What are other papers our study group should read?
Could you elaborate a bit more about the strategic assumptions of the agenda? For example,
1. Do you think your system is competitive with end-to-end Deep Learning approaches?
1.1. Assuming the answer is yes, do you expect CoEm to be preferable to users?
1.2. Assuming the answer is now, how do you expect it to get traction? Is the path through lawmakers understanding the alignment problem and banning everything that is end-to-end and doesn’t have the benefits of CoEm?
2. Do you think this is clearly the best possible path for everyone to take right now or more like “someone should do this, we are the best-placed organization to do this”?
PS: Kudos to publishing the agenda and opening up yourself to external feedback.
This update massively reduces my expectation for Conjecture’s future value. When you’re a small player in the field, you produce value through transferrable or bolt-on components, such as Conjecture’s interpretability and simulator work. CoEm on the other hand is completely disconnected from other AGI or ai safety work, and pretty much only has any impact if Conjecture is extraordinarily successful.
We mostly don’t know how to do alignment, so I take “not obviously bad, and really different from other approaches” to be a commendable quality for a research proposal. I also like research that is either meh, or extraordinarily successful, first because these pathways are going to almost always be neglected in a field, and second because I think most really great things in general come from these high risk of doing nothing (if you don’t have inside knowledge), high return if you do something strategies.
If you want to make a competitive agi from scratch (even if you only want “within 5 years of best ai”), you just have to start way earlier. If this project was anounced 7 years ago I’d like it much more, but now is just too late, you’d need huge miracles to finish in time.
Why do you think that it will not be competitive with other approaches?
For example, it took 10 years to sequence the first human genome. After nearly 7 years of work, another competitor started an alternative human genome project using completely another technology, and both projects were finished approximately at the same time.
I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors’ response to various questions people have.
The main idea:
“The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs.”
Do logical (not physical) emulation of the functions carried out by human brains.
Minimize the amount of Magic (uninterpretable processes) going on
Be able to understand the capabilities of your system, it is bounded
Situation CoEm in the human capabilities regime so failures are human-like
Be re-targetable
“Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.”
My thoughts:
So rather than an a research agenda, this is more of a desiderata for AI safety.
Authors acknowledge that this may be slower than just aiming for AGI. It’s unclear why they think this might work anyway. To the extent that Conjecture wants CoEm to replace the current deep learning paradigm, it’s unclear why they think it will be competitive or why others will adopt it; those are key strategic cruxes.
The authors also don’t give enough details for a reader to tell if they stand a chance; they’re missing a “how”. I look forward to them responding to the many comments raising important questions.
I understood the proposal as “let’s create relatively small, autonomous AIs of multi-component cognitive architecture instead of unitary DNNs becoming global services (and thus ushering cognitive globalisation)”. The key move seems to be that you want to increase the AI architecture’s interpretability by pushing some intelligence into the multi-component interaction from the “DNN depths”. In this setup, components may remain relatively small (e.g., below 100B parameter scale, i.e., at the level of the current SoTA DNNs), while their interaction leads to the emergence of general intelligence, which may not be achievable for unitary DNNs at these model scales.
The proposal seems to be in some ways very close to the recent Eric Drexler’s Open Agency Model. Both your proposal and Drexler’s “open agencies” seem to noticeably allude to the (classic) approaches to cognitive architecture, a la OpenCog, or LeCun’s H-JEPA.
As well as Drexler, it seems that you take for granted that multi-component cognitive architectures will have various “good” properties, ranging from interpretability to retargetability (which Michael Levin calls persuadability, btw). However, as I noted in this comment, none of these “good properties” are actually granted for an arbitrary multi-component AI. It must be demonstrated for a specific multi-component architecture why it is more interpretable (see also in the linked comment my note that “outputting plans” != interpretability), persuadable, robust, and ethical than an alternative architecture.
Your approach to this seems to be “minimal”, that is, something like “at least we know the level of interpretability, persuadability/corrigibility, robustness, and ethics of humans, so let’s try to build AI ‘after humans’ so that we get at least these properties, rather than worse properties”.
As such, I think this approach might not really solve anything and might not “end the acute risk period”, because it amounts to “building more human-like minds”, with all other economic, social, and political dynamics intact. I don’t see how building “just more humans” prevents corporations from pursuing cognitive globalisation in one form or another. The approach just seems to add more “brain-power” to the engine (either via building AI “like humans, but 2-3 sigmas more intelligent”, or just building many of them and keeping them running around the clock), without targeting any problems with the engine itself. In other words, the approach is not placed within a frame of a larger vision for civilisational intelligence architecture, which, I argued, is a requirement for any “AI safety paradigm”: “Both AI alignment paradigms (protocols) and AGI capability research (intelligence architectures) that don’t position themselves within a certain design for civilisational intelligence are methodologically misguided and could be dangerous.”
Humans have rather bad interpretability (see Chater’s “The Mind is Flat”), bad persuadability (see Scott Alexander’s “Trapped Priors”), bad robustness (see John Doyle’s hijackable language and memetic viruses), poor capability for communication and alignment (as anyone who tried to reliably communicate any idea to anyone else or to align with anyone else on anything can easily attest; cf. discussion of communication protocols in “Designing Ecosystems of Intelligence from First Principles”), and, of course, poor ethics. It’s also important to note that all these characteristics seem to be largely uncorrelated (at least, not strictly pegged) with raw general intelligence (GI factor) in humans.
I agree with you and Friston and others who worry about the cognitive globalisation and hyperscaling approach, but I also think that in order to improve the chances of humanity, we should at least aim at better than human architecture from the beginning. Creating many AI minds of the architecture “just like humans” (unless you count on some lucky emergence and that even trying to target “at humans” will yield architecture “better than humans”) doesn’t seem to help in civilisational intelligence and robustness, just accelerate the current trends.
I’m also optimistic because I see nothing impossibly hard or intractable in designing cognitive architectures that would be better than humans from the beginning and getting good engineering assurances that the architecture will indeed yield these (better than human) characteristics. It just takes time and a lot of effort (but so as architecturing AI “just like humans” does). E.g., (explicit) Active Inference architecture seems to help significantly at least with interpretability and the capacity for reliable and precise communication and, hence, belief and goal alignment (see Friston et al., 2022).
Thank you. You phrased the concerns about “integrating with a bigger picture” better than I could. To temper the negatives, I see at least two workable approaches, plus a framing for identifying more workable approaches.
Enable other safety groups to use and reproduce Conjecture’s research on CogEms so those groups can address more parts of the “bigger picture” using Conjecture’s findings. Under this approach, Conjecture becomes a safety research group, and the integration work of turning that research into actionable safety efforts becomes someone else’s task.
Understand the societal motivations for taking short-term steps toward creating dangerous AI, and demonstrate that CogEms are better suited for addressing those motivations, not just the motivations of safety enthusiasts, and not just hypothetical motivations that people “should” have. To take an example, OpenAI has taken steps towards building dangerous AI, and Microsoft has taken another dangerous step of attaching a massive search database to it, exposing the product to millions of people, and kicking off an arms race with Google. There were individual decision-makers involved in that process, not just as “Big Company does Bad Thing because that’s what big companies do.” Why did they make those decisions? What was the decision process for those product managers? Who created the pitch that convinced the executives? Why didn’t Microsoft’s internal security processes mitigate more of the risks? What would it have taken for Microsoft to have released a CogEm instead of Sydney? The answer is not just research advances. Finding the answers would involve talking to people familiar with these processes, ideally people that were somehow involved. Once safety-oriented people understand these things, it will be much easier for them to replace more dangerous AI systems with CogEms.
As a general framework, there needs to be more liquidity between the safety research and the high-end AI capabilities market, and products introduce liquidity between research and markets. Publishing research addresses one part of that by enabling other groups to productize that research. Understanding societal motivations addresses another part of that, and it would typically fall under “user research.” Clarity on how others can use your product is another part, one that typically falls under a “go-to-market strategy.” There’s also market awareness & education, which helps people understand where to use products, then the sales process, which helps people through the “last mile” efforts of actually using the product, then the nebulous process of scaling everything up. As far as I can tell, this is a minimal set of steps required for getting the high-end AI capabilities market to adopt safety features, and it’s effectively the industry standard approach.
As an aside, I think CogEms are a perfectly valid strategy for creating aligned AI. It doesn’t matter if most humans have bad interpretability, persuadability, robustness, ethics, or whatever else. As long as it’s possible for some human (or collection of humans) to be good at those things, we should expect that some subclass of CogEms (or collection of CogEms) can also be good at those things.
I also have concerns with this plan (mainly about timing, see my comment elsewhere on this thread). However, I disagree with your concerns. I think that a CogEm as described here has much better interpretability than a human brain (we can read the connections and weights completely). Based on my neuroscience background, I think that human brains are already more intepretable and controllable than black-box ml models. I think that the other problems you mention are greatly mitigated by the fact that we’d have edit-access to the weights and connections of the CogEm, thus would be able to redirect it much more easily than a human. I think that having full edit access to the weights and connections of a human brain would make that human quite controllable! Especially in combination with being able to wipe its memory and restore it to a previous state, rerun it over test scenarios many thousands of times with different parameters, etc.
On the surface level, it feels like an approach with a low probability of success. Simply put, the reason is that building CoEm is harder than building any AGI.
I consider it to be harder not only because it is not what everyone already does but also because it seems to be similar to AI people tried to create before deep learning and it didn’t work at all until they decided to switch to Magic which [comparatively] worked amazingly.
Some people are still trying to do something along the lines (e.g. Ben Goertzel) but I haven’t seen anything working at least remotely comparable with deep learning yet.
I think that the gap between (1) “having some AGI which is very helpful in solving alignment” and (2) “having very dangerous AGI” is probably quite small.
It seems very unlikely that CoEm will be the first system to reach (1), so probably it is going to be some other system. Now, we can either try to solve alignment using this system or wait until CoEm is improved enough so it reaches (1). Intuitively, it feels like we will go from (1) to (2) much faster than we will be able to improve CoEm enough.
So overall I am quite sceptical but I think it still can be the best idea if all other ideas are even worse. I think that more obvious ideas like “trying to understand how Magic works” (interoperability) and “trying to control Magic without understanding” (things like Constitutional AI etc.) are somewhat more promising, but there are a lot of efforts in this direction, so maybe somebody should try something else. Unfortunately, it is extremely hard to judge if it’s actually the case.
This post doesn’t make me actually optimistic about conjeture actually pulling this off, because for that I would have to see details but it does at least look like you understand why this is hard and why the easy versions like just telling gpt5 to imitate a nice human won’t work. And I like that this actually looks like a plan. Now maybe it will turn out to not be a good plan but at least is better than openAI’s plan of
“well figure out from trial and error how to make the Magic safe somehow”.
This is interesting.
I’m curious if you see this approach as very similar to Ought’s approach? Which is not a criticism, but I wonder if you see their approach as akin to yours, or what the major differences would be.
Connor gives more information about CoEms in a recent interview:
Doesn’t that require understanding why humans have (or don’t have) certain safety properties? That seems difficult.
To be frank, I have no idea what this is supposed to mean. If “make non-magical, humanlike systems” were actionable[1], there would not be much of an alignment problem. If this post is supposed to indicate that you think you have an idea for how to do this, but it’s a secret, fine. But what is written here, by itself, sounds like a wish to me, not like a research agenda.
Outside of getting pregnant, I suppose.
I hate to do it, but can’t resist the urge to add a link to my article First human upload as AI Nanny.
The idea is that human-like AI is intrinsically more safe and can be used to control AI development
As there are no visible ways to create safe self-improving superintelligence, but it is looming, we probably need temporary ways to prevent its creation. The only way to prevent it, is to create special AI, which is able to control and monitor all places in the world. The idea has been suggested by Goertzel in form of AI Nanny, but his Nanny is still superintelligent and not easy to control, as was shown by Bensinger at al. We explore here the ways to create the safest and simplest form of AI, which may work as AI Nanny. Such AI system will be enough to solve most problems, which we expect the AI will solve, including control of robotics, acceleration of the medical research, but will present less risk, as it will be less different from humans. As AI police, it will work as operation system for most computers, producing world surveillance system, which will be able to envision and stop any potential terrorists and bad actors in advance. As uploading technology is lagging, and neuromorphic AI is intrinsically dangerous, the most plausible way to human-based AI Nanny is either functional model of the human mind or a Narrow-AI empowered group of people.
Yes, unfortunately I think you are right avturchin. I have come to similar conclusions myself. See my comment elsewhere on this thread for some of my thoughts.
Edit: I read your article, and we have a lot of agreements, but also some important disagreements. I think the main disagreement is that I spent years studying neuroscience and thinking hard about intelligence amplification via brain-computer interfaces and genetic enhancement of adult brains, and also about brain preservation and uploading. For the past 7 years I’ve instead been studying machine learning with my thoughts focused on what will or won’t lead to AGI. I’m pretty convinced that we’re technologically a LOT closer to a successfully dangerously recursive self-improving AGI than we are to a functionally useful human brain emulation (much less an accurate whole brain scan).
Here are some recent thoughts I’ve written down about human brain emulation:
Our best attempts at high accuracy partial brain emulation so far have been very computationally inefficient. You can accept a loss of some detail of the emulation, and run a more streamlined simulation. This gives you many orders of magnitude speed up and removes a lot of need for accuracy of input detail. There’s lots of work on this that has been done. But the open question is, how much simplification is the correct amount of simplification? Will the resulting emulated brain be similar enough to the human you scanned the brain of that you will trust that emulation with your life? With the fate of all humanity? Would a team of such emulations be more trustworthy than copies of just the best one?
In the extreme of simplification you no longer have something which is clearly the brain you scanned, you have a system of interconnected neural networks heavily inspired by the circuitry of the human brain, with some vague initialization parameters suggested by the true brain scan. At this point, I wouldn’t even call it an emulation, more like a ‘brain-inspired AGI’. If you are going to go this route, then you might as well skip the messy physical aspects of step 1, and the complicated data analysis of step 2, and just jump straight to working with the already processed data publicly available. You won’t be likely to get a specific human out the end of this process even if you put a specific human in the beginning. Skipping the hard parts and glossing over the details is definitely the fastest easiest route to a ‘brain-like AGI’. It looses a big chunk of the value, which was that you knew and trusted a particular human and want an accurate emulation of that human to be entrusted with helping society. Yet, there is also still a lot of value you retain. This ‘brain-inspired AGI’ will share a lot of similarities with the brain and make us able to use the suite of interpretability tools that neuroscience has come up with for mammalian brains. This would be a huge boon for interpretability work. There would be control-problem gains, since you’d have a highly modular system with known functions of the various modules and the ability to tune the activity levels of those modules to get desired changes in behavioral output. There would be predictability gains, since our intuitions and study of human behavior would better carry over to a computational system so similar to the human brain. In contrast, our human-behavior-prediction instincts seem to mainly lead us astray when we attempt to apply them to the very non-brain-like systems of modern SotA ml systems like large language models. Many of the groups who have thought deeply about this strategic landscape have settled on trying for this ‘brain-like AGI’ strategy. The fact that it’s so much faster and easier (still not as fast or easy as just scaling/improving SotA mainstream ml) than the slow highly-accurate brain scan-and-emulation path means that people following the low-accuracy high-efficiency ‘brain-inspired AGI’ path will almost certainly have a functioning system many years before the slow path. And it will run many orders of magnitude faster. So, if we could conquer the Molochian social challenge of a race to the bottom and get society to delay making unbrainlike-AGI, then we certainly ought to pursue the slower but safer path of accurate human brain emulation.
I don’t think we have that option though. I think the race is on whether we like it or not, and there is little hope of using government to effectively stop the race with regulation. Slow it down a bit, that I think is feasible, but stop? No. Slow it down enough for the slow-but-safe path to have a chance of finishing in time? Probably not even that.
If this viewpoint is correct, then we still might have a hope of using regulatory slowdown plus the lossy-approximation of ‘brain-like AGI’ to get to a safer (more understandable and controllable) form of AGI. This is the strategy embraced by the Conjecture post about CogEms, by some leading Japanese AI researchers, by Astera:Obelisk (funded by Jed McCaleb), by Generally Intelligent (funded in part by Jed McCaleb), by some of the researchers at DeepMind (it’s not the organization-as-a-whole’s primary focus, but it is the hope/desire of some individuals within the organization), and a long list of academic researchers straddling the borderlands of neuroscience and machine learning.
I think it’s a good bet, and it was what I endorsed as of Fall 2022. Then I spent more time researching timelines and looking into what remaining obstacles I think must be cleared for mainstream ml to get to dangerously powerful AGI… and I decided that there was a big risk that we wouldn’t have time even for the fast lossy-approximation of ‘brain-like AGI’. This is why I responded to Conjecture’s recent post about CogEms with a concerned comment of this apparent strategic oversight on their part: https://www.lesswrong.com/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal?commentId=tA838GyENzGtWWARh
....
A terminology note: You coin the term human-like AI, others have used the term brain-like AI or brain-like AGI or CogEms (cognitive emulations). I believe all these terms point at a similar cluster of ideas, with somewhat different emphasis.
I don’t really have a preference amongst them, I just want to point out to interested readers that in order to come up with a broader range of discussion on these topics they should consider doing web searches for each of these terms.
A nice summary statement from avturchin’s article, which I recommend others interested in this subject read:
‘We explored two approaches where AI exists and doesn’t exist simultaneously: Human uploads based AI and secret intelligence organization of a superpower based AI (NSI-AI). The first is more technically difficult, but better, and NSI-AI is more realistic in near term, but seems to be less moral, as secret services are known not be not aligned with general human values, and will reach global domination probably via illegal covert operation including false flags attacks and terrorism.’
Not an actual objection to this proposal, but important note is that we don’t know upper limits of human cognitive capabilities. Like, humans were bad at arithmetics before invention of positional numerical systems and after that they became surprisingly good. We know that somewhere within human abilities is a capability to persuade you to let them out of the box and probably other various forms of “talk-control”. If we could have look in the mind of Einstein while having no clues about classical mechanics, we would have understood nothing. I agree that there is a possibility to not blow up with certainity, but I would like to invent corrigibility measures for CoEms before implementing them.
Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average “civilization”, with “thousands of years” to plan, that wants to break out of the box.
This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn’t want to commit species-cide on humans.
As fas as I understand, the point on this proposal is that “human-like cognitive architecture ≈ cognitive containability ≈ sort of safety”, not “human-like cognitive architecture ≈ human values”. I just want to say that even human can be cognitively uncontainable relatively to another human, because they can learn mental tricks that look to another human as Magic.
What do you see as the key differences between this and research in (theoretical) neuroscience? It seems to me like the goals you’ve mentioned are roughly the same goals as those of that field: roughly, to interpret human brain circuitry, often through modelling neural circuits via artificial neural networks. For example, see research like “Correlative Information Maximization Based Biologically Plausible Neural Networks for Correlated Source Separation”.
Looking forward to more details. I generally agree that building AIs that make “the right decisions for the right reasons” by having their thought processes parallel ours is a worthwhile direction.
On the chess thing, the reason why I went from ‘AI will kill our children’ to ‘AI will kill our parents’ shortly after I understood how AlphaZero worked was precisely because it seemed to play chess like I do.
I’m an OK chess player (1400ish), and I when I’m playing I totally do the ‘if I do this and then he moves this and then ….’ thing, but not very deep. And not any deeper than I did as a beginner, and I’m told grandmasters don’t really go any deeper.
Most of the witchy ability to see good chess moves is coming from an entirely opaque intuition about what moves would be good, and what positions are good.
You can’t explain this intuition in any way that allows it to move from mind to mind, although you can sometimes in retrospect justify it, or capture bits of it in words.
You train it through doing loads of tactics puzzles and playing loads of games.
AlphaZero was the first time I’d seen an AI algorithm where the magic didn’t go away after I’d understood it.
The first time I’d looked at something and thought: “Yes, that’s it, that’s intelligence. The same thing I’m doing. We’ve solved general game playing and that’s probably most of the way there.”
Human intelligence really does, to me, look like a load of opaque neural nets combined with a rudimentary search function.
Im struggling to understand how this is is different from “we will build aligned ai to align ai”. specifically: Can someone explain to me how human-like and AGI are different? Can someone explain to me why human-like AI avoids typical x-risk scenarios (given those human-likes could say clone themselves, speed up themselves and rewrite their own software and easily become unbounded)? Why isnt an emulated cognitive system a real cognitive system… i don’t understand how you can emulate a human-like intelligence and it not be the same as fully human-like.
currently my reading of this is we will build human-like AI because humans are bounded so it will be too, those bounds are: (1) sufffiecent to prevent xrisk (2) helpful for (and maybe even the reason for) alignment. Isnt a big wide open unsolved part of the alignment problem “how do we keep itelligent systems bounded”? What am I missing here?
I guess one maybe supplementary question as well is: how is this different from normal NLP capabilities research which is fundamentally about developing and understanding the limitations of human like intelligence? Most folks in the field say who publish in ACL conferences would explicitly think of this as what they are doing and not trying to build anything more capable than humans.
I had thoughts of doing something very like this a few years ago, back when I still thought we had around 20 years until AGI. Now I think we have <5 years until AGI, and I suspect you don’t have time for this. Do you also have a plan in mind for delaying the deployment of dangerous AGI to give humanity more time for working on alignment?
I do not ask this question rhetorically. I have thoughts along this line and would like to discuss them with you.
Can you elaborate on your comment?
It seems so intriguing to me, and I would love to learn more about “Why it’s a bad strategy if our AGI timeline is 5 years or less”?
Thanks for your interest Igor. Let me try better to explain my position. Basically, I am in agreement that ‘brain-like AGI’ or CogEms is the best fastest path towards a safe-enough AGI to at least help us make faster progress towards a more complete alignment solution. I am worried that this project will take about 10 −15 years, and that mainstream ML is going to become catastrophically dangerous within about 5 years.
So, to bridge this gap I think we need to manufacture a delay. We need to stretch the time we have between inventing dangerously capable AGI systems and when that invention leads to catastrophe. We also need to be pursuing alignment (in many ways, including via developing brain-like AGI), or the delay will be squandered. My frustration with Conjecture’s post here is that they talk about pursuing brain-like AGI without at least mentioning that time might be too short for that and that in order for them to be successful they need someone to be working on buying them time.
My focus over the past few months has been on how we might manufacture this delay. My current best answer is that we will have to make do with something like a combination of social and governmental forces and better monitoring tools (compute governance), better safety evaluations (e.g. along the lines of ARC’s safety evals, but even more diverse and thorough), and use of narrow AI tools to monitor and police the internet, using cyberweapons and perhaps official State police force or military might (in the case of international dispute) to stomp out rogue AGI before it can recursively self-improve to catastrophically strong intelligence. This is a tricky subject, potentially quite politically charged and frightening, an unpleasant scenario to talk about. Nevertheless, I think this is where we are and we must face that reality.
I believe that there will be several years before we have any kind of alignment solution, but where we have the ability to build rapidly recursively self-improving AGI which cannot be controlled. Our main concern during this period will be that many humans will not believe that their AGI cannot be controlled, and will see a path to great personal power by building and launching their own AGI. Perhaps also, terrorists and state actors will deliberately attempt to manufacture it as a weapon. How do we address this strategic landscape?
Thanks for your elaborate response!
But why do you think that this project will take so much time? Why can’t it be implemented faster?
Well, because a lot of scientists have been working on this for quite a while, and the brain is quite complex. On the plus side, there’s a lot of existing work. On the negative side, there’s not a lot of overlap between the group of people who know enough about programming and machine learning and large scale computing vs the group of people who know a lot about neuroscience and the existing partial emulations of the brain and existing detailed explanations of the circuits of the brain.
I mean, it does seem like the sort of project which could be tackled if a large well-funded determined set of experts with clear metrics worked on in parallel. I think I more despair of the idea of organizing such an effort successfully without it drowning in bureaucracy and being dragged down by the heel-dragging culture of current academia.
Basically, I estimate that Conjecture has a handful of smart determined people and maybe a few million dollars to work with, and I estimate this project being accomplished in a reasonable timeframe (like 2-3 years) as an effort that would cost hundreds of millions or billions of dollars and involve hundreds or thousands of people. Maybe my estimates are too pessimistic. I’m a lot less confident about my estimates of the cost of this project than I am in my estimates of how much time we have available to work with before strong AGI capable of recursive self-improvement gets built. I’m less confident about how long we will have between dangerous AGI is built and it actually gets out of control and causes a catastrophe. Another 2 years? 4? 5? I dunno. I doubt very much that it’ll be 10 years. Before then, some kind of action to reduce the threat needs to be taken. Plans which don’t seem to take this into account seem to me to be unhelpfully missing the point.
Eh. It’s sad if this problem is really so complex.
Thank you. At this point, I feel like I have to stick to some way to align AGI, even if it has not that big chance to succeed, because it looks like there are not that many options.
Well, there is the possibility that some wealthy entities (individuals, governments, corporations) will become convinced that they are truly at risk as AGI enters the Overton window. In which case, they might be willing to drop a billion of funding on the project, just in case. The lure of developing uploading as a path to immortality and superpowers may help convince some billionaires. Also, as AGI becomes more believable and the risk becomes more clear, top neuroscientists and programmers may be willing to drop their current projects and switch to working on uploading. If both those things happen, I think there’s a good chance it would work out. If not, I am doubtful.
This sounds potentially legislatable. More so then most ideas. You can put it into simple words. “AGI” can’t do anything that you couldn’t pay an employee to do.
You give a reason for not sharing technical details as “other actors are racing for as powerful and general AIs as possible.” I don’t understand. If your methods are for controlling powerful AIs, why wouldn’t you want these methods released?
I notice I am really confused here. Besides what I’ve already listed, I have learned Conjecture is a for profit company. The ability for a CoEm to replace a human task already makes it much more general than current models, yet you seem to imply that they will be less general but more aligned? Are these models your intended product, or is it the ability to align these models?
A guy from Conjecture told me about this proposal in the lines of “let’s create a human-level AI system that’s built kind of like humans and is safe to run at a human speed”, and it seemed like a surprisingly bad proposal, so I looked up this post and it still looks surprisingly bad:
Even if you succeed at this, how exactly do you plan to use it? Running one single human at a human speed seems like the kind of thing one can get by simply, you know, hiring someone; running a thousand of these things at 1000x normal speed means you’re running some completely different AI system that’s bound to have a lot of internal optimisation pressures leading to sharp left turn dynamics and all of that, and more importantly, you need to somehow make the whole system aligned, and my current understanding (from talking to that guy from Conjecture) is you don’t have any ideas for how to do that.
If it is a proposal of “how we want to make relatively safe capable systems”, then cool, I just want someone to be solving the alignment problem as in “safely preventing future unaligned AIs from appearing and killing everyone”.
The capabilities of one human-level intelligence running at 1x human speed are not enough to solve anything alignment-complete (or you’d be able to spend time on some alignment-complete problem and solve it on your own).
If it is not intended to be an “alignment proposal” and is just a proposal of running some AI system safely, I’d like to know whether Conjecture has some actual alignment plan that addresses the hard bits of the problem.
My summary:
(in case you want to copy-paste and share this)
Article by Conjecture, from february 25th 2023.
Title: `Cognitive Emulation: A Naive AI Safety Proposal`
Link: <https://www.lesswrong.com/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal>
(Note on this comment: I posted (something like) the above on Discord, and am copying it to here because I think it could be useful. Though I don’t know if this kind of non-interactive comment is okay.)
What interfaces are you planning to provide that other AI safety efforts can use? Blog posts? Research papers? Code? Models? APIs? Consulting? Advertisements?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I’m unsure whether CoEms as described could actually help in solving alignment. It may be the case that advancing alignment requires enough cognitive capabilities to make the system dangerous (unless we have already solved alignment).
I doubt that a single human mind which runs on a computer is guaranteed to be safe—this mind would think orders of magnitude faster (speed superintelligence) and copy itself. Maybe most humans would be safe. Maybe power corrupts.
re: reducing magic and putting bounds, I’m reminded of Cleo Nardo’s Hodge Podge Alignment proposal.
Contains a typo.
along as it is
==>as long as it is