It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
There were some ideas proposed in the paper “Conditioning Predictive Models: Risks and Strategies” by Hubinger et al. (2023). But since it was published over a year ago, I’m not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I’m not seeing anything like that in the paper’s citations.)
If you’re willing to share more on what those ways would be, I could forward that to the team that writes Sydney’s prompts when I visit Montreal
Thanks, I think you’re referring to:
There were some ideas proposed in the paper “Conditioning Predictive Models: Risks and Strategies” by Hubinger et al. (2023). But since it was published over a year ago, I’m not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I’m not seeing anything like that in the paper’s citations.)
Appreciate you getting back to me. I was aware of this paper already and have previously worked with one of the authors.