I’m NOT confused about ‘how could AIs send each other a link to their Git repo?’.
I’m confused as to what other info would convince an AI that any particular source code is what another AI is running.
How could any running system prove to another what source code it’s running now?
How could a system prove to another what source code it will be running in the future?
What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact—by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI’s CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem—you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other’s decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn’t go well, they’ll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner’s Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
What I don’t understand is how an agent won’t trust they are being deceived in this information sharing, especially if they are successfully deceived into cooperating then the defector gets a larger reward. Especially if sufficiently complex models interpretability is not computable.
In the human example, there exists a machine that can interpret the humans thoughts and is presumably far more computationally powerful than the human. Could this exist in the case of a super intelligent agent?
Probably, if the agents make themselves amenable to interpretability and inspection? If I, as a human, am going through a security checkpoint (say, at an airport), there are two ways I can get through successfully:
be clever enough to hide any contraband or otherwise circumvent the security measures in place.
just don’t bring any contraband and go through the checkpoint honestly.
As I get more clever and more adversarial, the security checkpoint might need to be more onerous and more carefully designed, to ensure that the first option remains infeasible to me. But the designers of the checkpoint don’t necessarily have to be smarter than me, they just have to make the first option difficult enough, relative to any benefit I would gain from it, such that I am more likely to choose the second option.
Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren’t unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
I mean incomputable in that computation would exceed the physical resources of the universe, whether it is P or NP. We can have a weaker definition of incomputable too, say it would exceed the capabilities of the agent by at least 1 OOM.
Specifically in regards to interpretability, if it isn’t possible to do easily, my intuition tells me it will require brute force simulation/exhaustive search. (Think mining bitcoin). In that case you need a system more powerful than whatever you are trying to interpret, making it kind of difficult in the context of two agents who are roughly on a similar capability level.
I think it will be difficult for goal maximizers to cooperate for the same reasons it would be difficult for humans to survive in a world with a superintelligent goal maximizer. In almost all argument leading to doom for humans and a paperclip maximizing kind of ASI, you can replace human with agent and the argument will still stand.
It is easy for intelligent humans and groups of humans to cooperate because very rarely do they have fanatical goals., and historically, the ones that did went to war.
This is a good point.
But I’m trying to develop some detailed intuitions about how this would or could work, in particular what practical difficulties there are and how they could be overcome.
(What do you mean by “waveform” reader”?)
Some practical difficulties with your first paragraph:
How can AI’s credibly claim that any particular CPU is running their code, or that a debugger connected to it isn’t being subverted via, e.g. MITM?
How can AI’s credibly claim that whatever the contents of a ‘CPU’s’ memory is at some point, it won’t be replaced at some later point?
How could one AI safely execute code given to it by another (e.g. via “memory dump”)?
How could one AI feasibly run another’s code ‘fast enough’ to be able to determine that it could (probably) trust it now (even assuming [1], [2], and [3] are solved)?
[1] points to what I see as a big difficulty, i.e. AIs will probably (or could) be very distributed computing systems and there might not be any practical way to ‘fit into a single box’ for, e.g. careful inspection by others.
An oscilloscope. Note that it isn’t particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn’t willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won’t let you shut it off, and is running on a distributed system, you’ve already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn’t surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn’t hard, just that there doesn’t seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we’re currently doing. IMO, it’s probably better to just build an AI that provably shares human values from the start.
I guessed that’s what you meant but was curious whether I was right!
Any idea how a ‘folded-up’ AI would imply anything in particular about the ‘expanded’ AI?
If an AI ‘folded itself up’ and provably/probably ‘deleted’ its ‘expanded’ form (and all instances of that), as well as any other AIs or not-AI-agents under its control, that does seem like it would be nearly “alignment-complete” (especially relative to our current AIs), even if, e.g. the AI expected to be able to escape that ‘confinement’.
But that doesn’t seem like it would work as a general procedure for AIs cooperating or even negotiating with each other.
That is they alter payoff matrix instead of trying to achieve CC in prisoner’s dilemma. And that may be more efficient than spending time and energy on proofs, source code verification protocols and yet unknown downsides of being an agent that you can robustly CC with, while being the same kind of agent.
We might develop schemes for auditable computation, where one party can come in at any time and check the other party’s logs. They should conform to the source code that the second party is supposed to be running; and also to any observable behavior that the second party has displayed. It’s probably possible to have logging and behavioral signalling be sufficiently rich that the first party can be convinced that that code is indeed being run (without it being too hard to check—maybe with some kind of probabilistically checkable proof).
However, this only provides a positive proof that certain code is being run, not a negative proof that no other code is being run at the same time. This part, I think, inherently requires knowing something about the other party’s computational resources. But if you can know about those, then
it’s possibleit might be possible. For a perhaps dystopian example, if you know your counterparty has compute A, and the program you want them to run takes compute B, then you could demand they do something (difficult but easily checkable) like invert hash functions, that’ll soak up around A-B of their compute, so they have nothing left over to do anything secret with.Can the agent just mute their capabilities when they do this computation? There are very slick ways to speed up computation and likewise slick ways to slow down computation. The agent could say, mess up cache coherency in its hardware, store data types differently, ignore the outputs of some of its compute, or maybe run faster than the other agent expects by devising a faster algorithm, using hardware level optimizations that use strange physics the other agent hasn’t thought of, etc.
Secondly, how would an agent convince another to run expensive code that takes up their entire compute? If you are some nation in medieval Europe, and another adjacent nation demanded every able bodied person to enter a triathlon to measure their net strength, would any sane leader agree to that?
Yup, all that would certainly make it more complicated. In a regime where this kind of tightly-controlled delegation were really important, we might also demand our counterparties standardize their hardware so they can’t play tricks like this.
I was picturing a more power-asymmetric situation, more like a feudal lord giving his vassals lots of busywork so they don’t have time to plot anything.
Wouldn’t that also leave them pretty vulnerable?
In the soaking-up-extra-compute case? Yeah, for sure, I can only really picture it (a) on a very short-term basis, for example maybe while linking up tightly for important negotations (but even here, not very likely). Or (b) in a situation with high power asymmetry. For example maybe there’s a story where ‘lords’ delegate work to their ‘vassals’, but the workload intensity is variable, so the vassals have leftover compute, and the lords demand that they spend it on something like blockchain mining. To compensate for the vulnerability this induces, the lords would also provide protection.