What source code and what machine code is actually being executed on some particular substrate is an empirical fact about the world, so in general, an AI (or a human) might learn it the way we learn any other fact—by making inferences from observations of the world.
This is a good point.
But I’m trying to develop some detailed intuitions about how this would or could work, in particular what practical difficulties there are and how they could be overcome.
For example, maybe you hook up a debugger or a waveform reader to the AI’s CPU to get a memory dump, reverse engineer the running code from the memory dump, and then prove some properties you care about follow inevitably from running the code you reverse engineered.
In general though, this is a pretty hard, unsolved problem—you probably run into a bunch of issues related to embedded agency pretty quickly.
(What do you mean by “waveform” reader”?)
Some practical difficulties with your first paragraph:
How can AI’s credibly claim that any particular CPU is running their code, or that a debugger connected to it isn’t being subverted via, e.g. MITM?
How can AI’s credibly claim that whatever the contents of a ‘CPU’s’ memory is at some point, it won’t be replaced at some later point?
How could one AI safely execute code given to it by another (e.g. via “memory dump”)?
How could one AI feasibly run another’s code ‘fast enough’ to be able to determine that it could (probably) trust it now (even assuming [1], [2], and [3] are solved)?
[1] points to what I see as a big difficulty, i.e. AIs will probably (or could) be very distributed computing systems and there might not be any practical way to ‘fit into a single box’ for, e.g. careful inspection by others.
An oscilloscope. Note that it isn’t particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn’t willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won’t let you shut it off, and is running on a distributed system, you’ve already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn’t surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn’t hard, just that there doesn’t seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we’re currently doing. IMO, it’s probably better to just build an AI that provably shares human values from the start.
I guessed that’s what you meant but was curious whether I was right!
If the AI isn’t willing or able to fold itself up into something that can be run entirely on single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
Any idea how a ‘folded-up’ AI would imply anything in particular about the ‘expanded’ AI?
If an AI ‘folded itself up’ and provably/probably ‘deleted’ its ‘expanded’ form (and all instances of that), as well as any other AIs or not-AI-agents under its control, that does seem like it would be nearly “alignment-complete” (especially relative to our current AIs), even if, e.g. the AI expected to be able to escape that ‘confinement’.
But that doesn’t seem like it would work as a general procedure for AIs cooperating or even negotiating with each other.
This is a good point.
But I’m trying to develop some detailed intuitions about how this would or could work, in particular what practical difficulties there are and how they could be overcome.
(What do you mean by “waveform” reader”?)
Some practical difficulties with your first paragraph:
How can AI’s credibly claim that any particular CPU is running their code, or that a debugger connected to it isn’t being subverted via, e.g. MITM?
How can AI’s credibly claim that whatever the contents of a ‘CPU’s’ memory is at some point, it won’t be replaced at some later point?
How could one AI safely execute code given to it by another (e.g. via “memory dump”)?
How could one AI feasibly run another’s code ‘fast enough’ to be able to determine that it could (probably) trust it now (even assuming [1], [2], and [3] are solved)?
[1] points to what I see as a big difficulty, i.e. AIs will probably (or could) be very distributed computing systems and there might not be any practical way to ‘fit into a single box’ for, e.g. careful inspection by others.
An oscilloscope. Note that it isn’t particularly realistic to hook up a scope to the kind of hardware that current AI systems are typically trained and run on.
But what I was trying to gesture at with this comment, is this is the kind of problem that the AI might be able to help you with. If the AI isn’t willing or able to fold itself up into something that can be run entirely on a single, human-inspectable CPU in an airgapped box, running code that is amenable to easily proving things about its behavior, you can just not cooperate with it, or not do whatever else you were planning to do by proving something about it, and just shut it off instead.
(If the AI is already adversarial to the point where it won’t let you shut it off, and is running on a distributed system, you’ve already lost. Willingness to fold itself up and be shut off means that the AI is already pretty aligned; it wouldn’t surprise me if this problem is alignment-complete.)
As for your practical difficulties, I agree these are all problems. I am not saying the problem you pose isn’t hard, just that there doesn’t seem to be anything that makes if fundamentally impossible to solve in principle.
There is lots of academic research on hardware security and secure hardware, verifiable computing (e.g. using zk-SNARKs), formally verified programming, chain-of-trust, etc. that attempt to make progress on small pieces of this problem (not necessarily with a specific focus on AI). Stitching all of these things together into an actually end-to-end secure system for interacting with a smarter-than-human AI system is probably possible, but will require solving many unsolved problems, and designing and building AI systems in different ways than we’re currently doing. IMO, it’s probably better to just build an AI that provably shares human values from the start.
I guessed that’s what you meant but was curious whether I was right!
Any idea how a ‘folded-up’ AI would imply anything in particular about the ‘expanded’ AI?
If an AI ‘folded itself up’ and provably/probably ‘deleted’ its ‘expanded’ form (and all instances of that), as well as any other AIs or not-AI-agents under its control, that does seem like it would be nearly “alignment-complete” (especially relative to our current AIs), even if, e.g. the AI expected to be able to escape that ‘confinement’.
But that doesn’t seem like it would work as a general procedure for AIs cooperating or even negotiating with each other.