Well, I presented a very simple formulation in my comment, so that could be a reasonable starting point.
But I agree that unfortunately there hasn’t been that much good formal analysis here that’s been written up. At least on my end, that’s for two reasons:
Most of the formal analysis of this form that I’ve published (e.g. this and this) has been focused on sycophancy (human imitator vs. direct translator) rather than deceptive alignment, as sycophancy is a substantially more tractable problem. Finding a prior that reasonably rules out deceptive alignment seems quite out of reach to me currently; at one point I thought a circuit prior might do it, but I now think that circuit priors don’t get rid of deceptive alignment.
I’m currently more optimistic about empirical evidence rather than theoretical evidence for resolving this question, which is why I’ve been focusing on projects such as Sleeper Agents.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis. I feel like you still aren’t understanding the key criticism here—it’s really not about Solomonoff induction—and I’m not sure how to explain that in any way other than how I’ve already done so.
There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
I’m going to assume you just aren’t very familiar with my writing, because working through empirical evidence about neural network inductive biases is somethingI loveto doall thetime.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis.
What? Which formalism? I don’t see how this is true at all. Please elaborate or send an example of “modifying” Solomonoff so that all the programs have fixed length, or “modifying” the circuit prior so all circuits are the same size.
No, I’m pretty familiar with your writing. I still don’t think you’re focusing on mainstream ML literature enough because you’re still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.
Well, I presented a very simple formulation in my comment, so that could be a reasonable starting point.
But I agree that unfortunately there hasn’t been that much good formal analysis here that’s been written up. At least on my end, that’s for two reasons:
Most of the formal analysis of this form that I’ve published (e.g. this and this) has been focused on sycophancy (human imitator vs. direct translator) rather than deceptive alignment, as sycophancy is a substantially more tractable problem. Finding a prior that reasonably rules out deceptive alignment seems quite out of reach to me currently; at one point I thought a circuit prior might do it, but I now think that circuit priors don’t get rid of deceptive alignment.
I’m currently more optimistic about empirical evidence rather than theoretical evidence for resolving this question, which is why I’ve been focusing on projects such as Sleeper Agents.
Right, and I’ve explained why I don’t think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I’m pretty baffled at why you don’t pay more attention to that stuff.
It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that’s usually what I do when I run this sort of analysis. I feel like you still aren’t understanding the key criticism here—it’s really not about Solomonoff induction—and I’m not sure how to explain that in any way other than how I’ve already done so.
I’m going to assume you just aren’t very familiar with my writing, because working through empirical evidence about neural network inductive biases is something I love to do all the time.
What? Which formalism? I don’t see how this is true at all. Please elaborate or send an example of “modifying” Solomonoff so that all the programs have fixed length, or “modifying” the circuit prior so all circuits are the same size.
No, I’m pretty familiar with your writing. I still don’t think you’re focusing on mainstream ML literature enough because you’re still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.