OK, here are my guesses, without seeing anyone else’s answers. I think I’m probably wrong, which is why I’m asking this question:
1.a. Underdetermined? It depends on what we mean by the outer objective, and what we mean when we assume it has no inner alignment problems? See e.g. this discussion. That said, yeah it totally seems possible. If the part that predicts reward gets good at generalizing, it should be able to reason/infer/guess that hacking the reward function would yield tons of reward. And then that’s what the agent would do.
1.b. Yes? Even though I usually think of inner alignment failures on the context of black-box neural nets and this is a bit more transparent (because of the tree search at the heart) I think the usual arguments would apply?
2. Yes? My limited, probably faulty understanding is that the policy is trained to approximate what you get when you search the tree of possible actions and predicted consequences some depth and then look at how good or bad the resulting states seem to be. There’s nothing in there about doing causal interventions or do-calculus or whatnot; so I assume that the predicted-consequences network would approximate conditionalization… however Caspar Oesterheld has a paper arguing I think that under some conditions neural nets approximate CDT in the limit, so *shrugs*. I should go read it again and see if it applies.
3. Yes? Again, if it’s approximating what you get when you search the tree and pick the action that leads to the best predicted state, that sure seems pretty consequentialist. If initially there was some “never lie” heuristic, then wouldn’t it disappear quickly once you encountered situations where lying led to a better predicted outcome? Sure, you could hack it by always predicting bad outcomes from lying, but that contradicts our “sufficiently knowledgeable and capable” hypothetical.
4. I don’t know but from what I’ve seen the answer is yeah, just Atari and stuff like that.
5. I don’t know but it seems EfficientZero took 7 hours on 4 GPUs to do one 2-subjective-hour training run. That gives us a sense of how many FLOPS it must run at (maybe 5 x 10^14 operations per subjective second?), but presumably it has many operations per parameter… I guess it would have made news if it had parameter count comparable to large language models, so maybe it has something like 10^8 parameters? But surely they aren’t doing a million ops per parameter per subjective second? Gosh I really am just guessing.
6. I see no obstacle, it seems like a quantitative question of how much bigger we’d need to make it, how much longer we’d need to train it, and maybe what sort of data we’d have to give it. Qualitatively it seems like this would probably keep scaling to superintelligence if we poured enough into it (but this is probably true for lots of architectures, and doesn’t mean much in practice since we don’t have galaxy-sized computers.)
5. So at a high level we know exactly how many flops it takes to simulate atari—it’s about 10^6 flop/s vs perhaps 10^12 flops/s for typical games today (with the full potential of modern gpus at 10^14 flops/s, similar to reality). So you (and by you I mean DM) can actually directly compare—using knowledge of atari, circuits, or the sim code—the computational cost of the learned atari predictive model inside the agent vs the simulation cost of the (now defunct) actual atari circuit. There isn’t much uncertainty in that calculation—both are known things (not like comparing to reality).
The parameter count isn’t really important—this isn’t a GPT-3 style language model designed to absorb the web. It’s parameter count is about as relevant as the parameter count of a super high end atari simulator that can simulate billions of atari frames per second—not much, because atari is small. Also—that is exactly what this thing is.
OK, here are my guesses, without seeing anyone else’s answers. I think I’m probably wrong, which is why I’m asking this question:
1.a. Underdetermined? It depends on what we mean by the outer objective, and what we mean when we assume it has no inner alignment problems? See e.g. this discussion. That said, yeah it totally seems possible. If the part that predicts reward gets good at generalizing, it should be able to reason/infer/guess that hacking the reward function would yield tons of reward. And then that’s what the agent would do.
1.b. Yes? Even though I usually think of inner alignment failures on the context of black-box neural nets and this is a bit more transparent (because of the tree search at the heart) I think the usual arguments would apply?
2. Yes? My limited, probably faulty understanding is that the policy is trained to approximate what you get when you search the tree of possible actions and predicted consequences some depth and then look at how good or bad the resulting states seem to be. There’s nothing in there about doing causal interventions or do-calculus or whatnot; so I assume that the predicted-consequences network would approximate conditionalization… however Caspar Oesterheld has a paper arguing I think that under some conditions neural nets approximate CDT in the limit, so *shrugs*. I should go read it again and see if it applies.
3. Yes? Again, if it’s approximating what you get when you search the tree and pick the action that leads to the best predicted state, that sure seems pretty consequentialist. If initially there was some “never lie” heuristic, then wouldn’t it disappear quickly once you encountered situations where lying led to a better predicted outcome? Sure, you could hack it by always predicting bad outcomes from lying, but that contradicts our “sufficiently knowledgeable and capable” hypothetical.
4. I don’t know but from what I’ve seen the answer is yeah, just Atari and stuff like that.
5. I don’t know but it seems EfficientZero took 7 hours on 4 GPUs to do one 2-subjective-hour training run. That gives us a sense of how many FLOPS it must run at (maybe 5 x 10^14 operations per subjective second?), but presumably it has many operations per parameter… I guess it would have made news if it had parameter count comparable to large language models, so maybe it has something like 10^8 parameters? But surely they aren’t doing a million ops per parameter per subjective second? Gosh I really am just guessing.
6. I see no obstacle, it seems like a quantitative question of how much bigger we’d need to make it, how much longer we’d need to train it, and maybe what sort of data we’d have to give it. Qualitatively it seems like this would probably keep scaling to superintelligence if we poured enough into it (but this is probably true for lots of architectures, and doesn’t mean much in practice since we don’t have galaxy-sized computers.)
Just real quick
5. So at a high level we know exactly how many flops it takes to simulate atari—it’s about 10^6 flop/s vs perhaps 10^12 flops/s for typical games today (with the full potential of modern gpus at 10^14 flops/s, similar to reality). So you (and by you I mean DM) can actually directly compare—using knowledge of atari, circuits, or the sim code—the computational cost of the learned atari predictive model inside the agent vs the simulation cost of the (now defunct) actual atari circuit. There isn’t much uncertainty in that calculation—both are known things (not like comparing to reality).
The parameter count isn’t really important—this isn’t a GPT-3 style language model designed to absorb the web. It’s parameter count is about as relevant as the parameter count of a super high end atari simulator that can simulate billions of atari frames per second—not much, because atari is small. Also—that is exactly what this thing is.