What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact—by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI’s CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem—you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other’s decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn’t go well, they’ll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner’s Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
What I don’t understand is how an agent won’t trust they are being deceived in this information sharing, especially if they are successfully deceived into cooperating then the defector gets a larger reward. Especially if sufficiently complex models interpretability is not computable.
In the human example, there exists a machine that can interpret the humans thoughts and is presumably far more computationally powerful than the human. Could this exist in the case of a super intelligent agent?
Could this exist in the case of a super intelligent agent?
Probably, if the agents make themselves amenable to interpretability and inspection? If I, as a human, am going through a security checkpoint (say, at an airport), there are two ways I can get through successfully:
be clever enough to hide any contraband or otherwise circumvent the security measures in place.
just don’t bring any contraband and go through the checkpoint honestly.
As I get more clever and more adversarial, the security checkpoint might need to be more onerous and more carefully designed, to ensure that the first option remains infeasible to me. But the designers of the checkpoint don’t necessarily have to be smarter than me, they just have to make the first option difficult enough, relative to any benefit I would gain from it, such that I am more likely to choose the second option.
Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren’t unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
yes but part of what makes interpretability hard is it may be incomputable.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
I mean incomputable in that computation would exceed the physical resources of the universe, whether it is P or NP. We can have a weaker definition of incomputable too, say it would exceed the capabilities of the agent by at least 1 OOM.
Specifically in regards to interpretability, if it isn’t possible to do easily, my intuition tells me it will require brute force simulation/exhaustive search. (Think mining bitcoin). In that case you need a system more powerful than whatever you are trying to interpret, making it kind of difficult in the context of two agents who are roughly on a similar capability level.
I think it will be difficult for goal maximizers to cooperate for the same reasons it would be difficult for humans to survive in a world with a superintelligent goal maximizer. In almost all argument leading to doom for humans and a paperclip maximizing kind of ASI, you can replace human with agent and the argument will still stand.
It is easy for intelligent humans and groups of humans to cooperate because very rarely do they have fanatical goals., and historically, the ones that did went to war.
What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact—by making inferences from observations of the world.
This is a good point.
But I’m trying to develop some detailed intuitions about how this would or could work, in particular what practical difficulties there are and how they could be overcome.
For example, maybe you hook up a debugger or a waveform reader to the AI’s CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem—you probably run into a bunch of issues related to embedded agency pretty quickly.
(What do you mean by “waveform” reader”?)
Some practical difficulties with your first paragraph:
How can AI’s credibly claim that any particular CPU is running their code, or that a debugger connected to it isn’t being subverted via, e.g. MITM?
How can AI’s credibly claim that whatever the contents of a ‘CPU’s’ memory is at some point, it won’t be replaced at some later point?
How could one AI safely execute code given to it by another (e.g. via “memory dump”)?
How could one AI feasibly run another’s code ‘fast enough’ to be able to determine that it could (probably) trust it now (even assuming [1], [2], and [3] are solved)?
[1] points to what I see as a big difficulty, i.e. AIs will probably (or could) be very distributed computing systems and there might not be any practical way to ‘fit into a single box’ for, e.g. careful inspection by others.
An oscilloscope. Note that it isn’t particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn’t willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won’t let you shut it off, and is running on a distributed system, you’ve already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn’t surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn’t hard, just that there doesn’t seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we’re currently doing. IMO, it’s probably better to just build an AI that provably shares human values from the start.
I guessed that’s what you meant but was curious whether I was right!
If the AI isn’t willing or able to fold itself up into something that can be run entirely on single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
Any idea how a ‘folded-up’ AI would imply anything in particular about the ‘expanded’ AI?
If an AI ‘folded itself up’ and provably/probably ‘deleted’ its ‘expanded’ form (and all instances of that), as well as any other AIs or not-AI-agents under its control, that does seem like it would be nearly “alignment-complete” (especially relative to our current AIs), even if, e.g. the AI expected to be able to escape that ‘confinement’.
But that doesn’t seem like it would work as a general procedure for AIs cooperating or even negotiating with each other.
conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day
That is they alter payoff matrix instead of trying to achieve CC in prisoner’s dilemma. And that may be more efficient than spending time and energy on proofs, source code verification protocols and yet unknown downsides of being an agent that you can robustly CC with, while being the same kind of agent.
What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact—by making inferences from observations of the world.
For example, maybe you hook up a debugger or a waveform reader to the AI’s CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem—you probably run into a bunch of issues related to embedded agency pretty quickly.
To get an intuition for why these problems might be possible to solve, consider the tools that humans who want to cooperate or verify each other’s decision processes might use in more restricted circumstances.
For example, maybe two governments which distrust each other conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day. Each government publicly declares that if the hostage exchange doesn’t go well, they’ll shoot a bunch of the extra hostages.
Or, suppose two humans are in a True Prisoner’s Dilemma with each other, in a world where fMRI technology has advanced to the point where we can read off semantically meaningful inner thoughts from a scan in real-time. Both players agree to undergo such a scan and make the scan results available to their opponent, while making their decision. This might not allow them to prove that their opponent will cooperate iff they cooperate, but it will still probably make it a lot easier for them to robustly cooperate, for valid decision theoretic reasons, rather than for reasons of honor or a spirit of friendliness.
What I don’t understand is how an agent won’t trust they are being deceived in this information sharing, especially if they are successfully deceived into cooperating then the defector gets a larger reward. Especially if sufficiently complex models interpretability is not computable.
In the human example, there exists a machine that can interpret the humans thoughts and is presumably far more computationally powerful than the human. Could this exist in the case of a super intelligent agent?
Probably, if the agents make themselves amenable to interpretability and inspection? If I, as a human, am going through a security checkpoint (say, at an airport), there are two ways I can get through successfully:
be clever enough to hide any contraband or otherwise circumvent the security measures in place.
just don’t bring any contraband and go through the checkpoint honestly.
As I get more clever and more adversarial, the security checkpoint might need to be more onerous and more carefully designed, to ensure that the first option remains infeasible to me. But the designers of the checkpoint don’t necessarily have to be smarter than me, they just have to make the first option difficult enough, relative to any benefit I would gain from it, such that I am more likely to choose the second option.
Yes but part of what makes interpretability hard is it may be incomputable. And part of what makes making interpretable models difficult to create is that there may be some inner deception you are missing. I think any argument that makes alignment difficult symmetrically makes two agents working together difficult. I mean think about it, if agent 1 can cooperate with agent 2, agent 1 can also bootstrap a version of agent 2 aligned with itself.
Intuitively interpreting an AI feels like we are trying to break some cryptographic hash, but much harder. For example, as humans even interpreting gpt-2 is a monumental task for us despite our relatively vastly superior computational abilities and knowledge. After all, something that’s 2 OOMs superior (gpt-4) could only confidently interpret a fraction of a percent of neurons, and only with certain confidence. It’s likely these are relatively trivial ones too.
Therefore, I think the only way for agent 1 to confirm the motives of agent 2 and eliminate unknown unknowns is to fully simulate it, assuming there isn’t some shortcut to interpretability. And of course fully simulating it requires it to accomplish boxing agent 2, and probably have a large capability edge over it.
I don’t think the TSA analogy is perfect either. There is no (Defect, Cooperate) case. The risks in it are known unknowns. You have a good understanding of what happens if you fail as you just go to jail, but with adversarial agents, they have absolutely no understanding of what happens to them if they are wrong. Being wrong here means you get eliminated. So (D, C)’s negative expectation for example is unbounded. You also have a reasonable expectation of what TSA will do to verify you aren’t breaking the law. Just their methodology of scanning you eliminates a lot of detection methods from the probability space. You have no idea what another agent will do. Maybe they sent an imperfect clone to communicate with you, maybe there’s a hole in the proposed consensus protocol you haven’t considered. The point is this is one giant unknown unknown that can’t be modeled as you are trying to predict what someone with greater capabilities than you can do.
Lastly, I think this chain of reasoning breaks down if there is some robust consensus protocol or some way to make models fully interpretable. However if such a method exists, we would probably just initially align the AGI to begin with as our narrow AI would help us create interpretable AGI, or a robust system of detecting inner deception.
I agree the analogy breaks down in the case of very adversarial agents and / or big gaps in intelligence or power. My point is just that these problems probably aren’t unsolvable in principle, for humans or AI systems who have something to gain from cooperating or trading, and who are at roughly equal but not necessarily identical levels of intelligence and power. See my response to the sibling comment here for more.
Aside: I think “incomputable” is a vague term, and results from computational complexity theory often don’t have as much relevance to these kind of problems as people intuitively expect. See the second part of this comment for more.
I mean incomputable in that computation would exceed the physical resources of the universe, whether it is P or NP. We can have a weaker definition of incomputable too, say it would exceed the capabilities of the agent by at least 1 OOM.
Specifically in regards to interpretability, if it isn’t possible to do easily, my intuition tells me it will require brute force simulation/exhaustive search. (Think mining bitcoin). In that case you need a system more powerful than whatever you are trying to interpret, making it kind of difficult in the context of two agents who are roughly on a similar capability level.
I think it will be difficult for goal maximizers to cooperate for the same reasons it would be difficult for humans to survive in a world with a superintelligent goal maximizer. In almost all argument leading to doom for humans and a paperclip maximizing kind of ASI, you can replace human with agent and the argument will still stand.
It is easy for intelligent humans and groups of humans to cooperate because very rarely do they have fanatical goals., and historically, the ones that did went to war.
This is a good point.
But I’m trying to develop some detailed intuitions about how this would or could work, in particular what practical difficulties there are and how they could be overcome.
(What do you mean by “waveform” reader”?)
Some practical difficulties with your first paragraph:
How can AI’s credibly claim that any particular CPU is running their code, or that a debugger connected to it isn’t being subverted via, e.g. MITM?
How can AI’s credibly claim that whatever the contents of a ‘CPU’s’ memory is at some point, it won’t be replaced at some later point?
How could one AI safely execute code given to it by another (e.g. via “memory dump”)?
How could one AI feasibly run another’s code ‘fast enough’ to be able to determine that it could (probably) trust it now (even assuming [1], [2], and [3] are solved)?
[1] points to what I see as a big difficulty, i.e. AIs will probably (or could) be very distributed computing systems and there might not be any practical way to ‘fit into a single box’ for, e.g. careful inspection by others.
An oscilloscope. Note that it isn’t particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn’t willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won’t let you shut it off, and is running on a distributed system, you’ve already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn’t surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn’t hard, just that there doesn’t seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we’re currently doing. IMO, it’s probably better to just build an AI that provably shares human values from the start.
I guessed that’s what you meant but was curious whether I was right!
Any idea how a ‘folded-up’ AI would imply anything in particular about the ‘expanded’ AI?
If an AI ‘folded itself up’ and provably/probably ‘deleted’ its ‘expanded’ form (and all instances of that), as well as any other AIs or not-AI-agents under its control, that does seem like it would be nearly “alignment-complete” (especially relative to our current AIs), even if, e.g. the AI expected to be able to escape that ‘confinement’.
But that doesn’t seem like it would work as a general procedure for AIs cooperating or even negotiating with each other.
That is they alter payoff matrix instead of trying to achieve CC in prisoner’s dilemma. And that may be more efficient than spending time and energy on proofs, source code verification protocols and yet unknown downsides of being an agent that you can robustly CC with, while being the same kind of agent.