It depends on what you know about the model and the reason you have to be concerned in the first place (if it’s just “somehow”, that’s not very convincing).
You might be worried that training it leads to the emergence of inner-optimizers, be them ones that are somehow “trying” to be good at prediction in a way that might generalize to taking real-life actions, approximating the searchy part of the humans they are trying to predict, or just being RL agents. If you are just using basically standard architectures with a lot more compute, these all seem unlikely. But if I were you, I might try to test its ability to perform well in a domain it has never seen, where humans start by performing poorly but very quickly learn what to do (think about video games with new mechanics). If it does well, you have a qualitatively new thing on your hands, don’t deploy, study it instead. If a priori for some reason you think it could happen, and only a small subset of all the data is necessary to achieve that, do a smaller training run first with that data.
Or you might be worried about mostly external consequentialist cognition (think explicit textual it-then-elses). In that case, existing systems can already do it to some extent, and you should worry about how good its reasoning actually is, so perform capability evaluations. If it looks that there is some way of getting it to do novel research by any known method or that it’s getting close, don’t deploy, otherwise someone might figure out how to use it to do AI research, and then you get a singularity.
And in any case, you should worry about the effects your system will have on the AI race. Your AI might not be dangerous, but if it is a good enough lawyer or programmer that it starts getting many people out of their jobs, investment in AI research will increase a lot and someone will figure out how to create an actual AGI sooner than they would otherwise.
Edit: And obviously you should also test how useful it could be for people trying to do mundane harm (e.g. with existing pathogens) and, separately, there might not be a hard threshold on how good a model is at doing research that it starts being dangerous, so they might get there little by little and you would be contributing to that.
Edit in response to the second clarification: Downscale the relevant factors, like amount of training data, number of parameters and training time, or use a known-to-be-inferior architecture until the worrying capabilities go away. Otherwise, you need to solve the alignment problem.
Edit in response to Beth Barnes’s comment: You should probably have people reviewing outputs to check the model behaves well, but if you actually think you need measures like “1000 workers with technical undergrad degrees, paid $50/hr” because you are worried it somehow kills you, then you simply shouldn’t deploy it. It’s absurd to have the need to check whether a commercial product is an existential threat, or anything close to that.
It depends on what you know about the model and the reason you have to be concerned in the first place (if it’s just “somehow”, that’s not very convincing).
You might be worried that training it leads to the emergence of inner-optimizers, be them ones that are somehow “trying” to be good at prediction in a way that might generalize to taking real-life actions, approximating the searchy part of the humans they are trying to predict, or just being RL agents. If you are just using basically standard architectures with a lot more compute, these all seem unlikely. But if I were you, I might try to test its ability to perform well in a domain it has never seen, where humans start by performing poorly but very quickly learn what to do (think about video games with new mechanics). If it does well, you have a qualitatively new thing on your hands, don’t deploy, study it instead. If a priori for some reason you think it could happen, and only a small subset of all the data is necessary to achieve that, do a smaller training run first with that data.
Or you might be worried about mostly external consequentialist cognition (think explicit textual it-then-elses). In that case, existing systems can already do it to some extent, and you should worry about how good its reasoning actually is, so perform capability evaluations. If it looks that there is some way of getting it to do novel research by any known method or that it’s getting close, don’t deploy, otherwise someone might figure out how to use it to do AI research, and then you get a singularity.
And in any case, you should worry about the effects your system will have on the AI race. Your AI might not be dangerous, but if it is a good enough lawyer or programmer that it starts getting many people out of their jobs, investment in AI research will increase a lot and someone will figure out how to create an actual AGI sooner than they would otherwise.
Edit: And obviously you should also test how useful it could be for people trying to do mundane harm (e.g. with existing pathogens) and, separately, there might not be a hard threshold on how good a model is at doing research that it starts being dangerous, so they might get there little by little and you would be contributing to that.
Edit in response to the second clarification: Downscale the relevant factors, like amount of training data, number of parameters and training time, or use a known-to-be-inferior architecture until the worrying capabilities go away. Otherwise, you need to solve the alignment problem.
Edit in response to Beth Barnes’s comment: You should probably have people reviewing outputs to check the model behaves well, but if you actually think you need measures like “1000 workers with technical undergrad degrees, paid $50/hr” because you are worried it somehow kills you, then you simply shouldn’t deploy it. It’s absurd to have the need to check whether a commercial product is an existential threat, or anything close to that.