I don’t think the usual arguments apply as obviously here. “Maximal Diamond” is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it’s a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I’ve seen more detailed arguments for.
I’m partly confused about the phrasing “we have no idea how to do this.” (which is stronger than “we don’t currently have a plan for how to do this.”)
But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn’t work, let me think through my own proposal of how I’d go about solving the problem, and see if I can think of obvious holes.
Problems currently known to me:
Reward hijacking
Point 19 in List of Lethalities (“there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment”).
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system. (I didn’t really get this until reading this post and haven’t finished thinking through the concept)
Main ingredients I’m imagining: (disclaimer: I’m a layman making a lot of informed guesses, wouldn’t be surprised it
First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind’s General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you’d need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I’m not sure it even requires new advances.
(Maybe scrub the language model of all references to ML/programming, initially. They’ll be helpful eventually but maybe don’t give the AGI a headstart on self-modification.)
Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn’t expected to invent it’s own diamond-making technology. But it’s presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it’s rewarded for choosing between them accurately. So it has to develop an understanding of physics.
(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).
Some notable training-sets it needs to include:
digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it’s ability to parse what’s going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It’s trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it’s taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it’s puzzle is to notice that it physically exists, and make changes to it’s hardware or software, but it doesn’t directly hijack it’s own reward function.
(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn’t just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don’t expect the Successfully Aligned “murder everyone and make diamonds forever” bot to be completed before “murder everyone and make some Random Other Thing Forever” bot)
Even though my goal is a murder-bot-that-makes-diamonds-forever, I’m probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn’t sure whether the thing it’s making is diamond, it makes a little bit first, asks humans to verify that it’s diamond, etc. Do similar training on “don’t modify the input channel for ‘was it actually diamond tho?’)
Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. “The most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system.”
And… okay actually this is a new thought for me, and I’m not sure how to think about it yet. I can see how it was probably meant to be included in the “confusingly pervasive consequentialism” concept, but I didn’t get the “and therefore, impervious to gradient descent” argument till just now.
I’m out of time for now, will think about this more.
I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.
I don’t think the usual arguments apply as obviously here. “Maximal Diamond” is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it’s a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I’ve seen more detailed arguments for.
I’m partly confused about the phrasing “we have no idea how to do this.” (which is stronger than “we don’t currently have a plan for how to do this.”)
But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn’t work, let me think through my own proposal of how I’d go about solving the problem, and see if I can think of obvious holes.
Problems currently known to me:
Reward hijacking
Point 19 in List of Lethalities (“there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment”).
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system. (I didn’t really get this until reading this post and haven’t finished thinking through the concept)
Main ingredients I’m imagining: (disclaimer: I’m a layman making a lot of informed guesses, wouldn’t be surprised it
First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind’s General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you’d need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I’m not sure it even requires new advances.
(Maybe scrub the language model of all references to ML/programming, initially. They’ll be helpful eventually but maybe don’t give the AGI a headstart on self-modification.)
Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn’t expected to invent it’s own diamond-making technology. But it’s presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it’s rewarded for choosing between them accurately. So it has to develop an understanding of physics.
(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).
Some notable training-sets it needs to include:
digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it’s ability to parse what’s going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It’s trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it’s taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it’s puzzle is to notice that it physically exists, and make changes to it’s hardware or software, but it doesn’t directly hijack it’s own reward function.
(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn’t just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don’t expect the Successfully Aligned “murder everyone and make diamonds forever” bot to be completed before “murder everyone and make some Random Other Thing Forever” bot)
Even though my goal is a murder-bot-that-makes-diamonds-forever, I’m probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn’t sure whether the thing it’s making is diamond, it makes a little bit first, asks humans to verify that it’s diamond, etc. Do similar training on “don’t modify the input channel for ‘was it actually diamond tho?’)
Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. “The most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system.”
And… okay actually this is a new thought for me, and I’m not sure how to think about it yet. I can see how it was probably meant to be included in the “confusingly pervasive consequentialism” concept, but I didn’t get the “and therefore, impervious to gradient descent” argument till just now.
I’m out of time for now, will think about this more.
I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.