It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.
In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models’ outputs before reasoning about superrationality, so it would turn things into a version of Newcomb’s problem with transparent boxes. This might make coordination between the models less likely? Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
The other possibility would be to not rely on IDA at all, instead just training a superhuman model and using it directly. Maybe one could extract superhuman knowledge from them safely via some version of microscope AI? Of course, in this case, the model might still reason about humans using similar models, based on its generalization ability alone. Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
Oh interesting. I think this still runs into the issue that you’ll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).
Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
I was imagining that we train the model to predict e.g. tomorrow’s newspaper given today’s. The fact that it’s not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.
Thank you!
It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.
In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models’ outputs before reasoning about superrationality, so it would turn things into a version of Newcomb’s problem with transparent boxes. This might make coordination between the models less likely? Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
The other possibility would be to not rely on IDA at all, instead just training a superhuman model and using it directly. Maybe one could extract superhuman knowledge from them safely via some version of microscope AI? Of course, in this case, the model might still reason about humans using similar models, based on its generalization ability alone. Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
Oh interesting. I think this still runs into the issue that you’ll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).
I was imagining that we train the model to predict e.g. tomorrow’s newspaper given today’s. The fact that it’s not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.