One possible approach to fixing this is to try to get wayyyy more empirical, and try to produce proof-of-concept implementations of various adversaries we are worried we might face in the future. My analogy would be, there’s a world of difference between speculating about the bogey monster and producing a grainy photo of the bogey monster; the second can at least maaaaaybe be discussed with skeptical people, whereas the first cannot (productively) be.
Anyway, that’s a long-winded way of saying, it seemed to me that it might be useful to implement a treacherous mesa-optimizer in a toy grid-world, so I did. Here is the colab. Below I give a brief recap of the high-level results.
I’m not saying this isn’t helpful work worth the tradeoff, but this sounds like very early stage gain of function research? Have you thought about the risks of this line of investigation turning into that? :)
I had been thinking about it in terms of capabilities research—is this likely to lead to capabilities advancements? My gut says that it is highly unlikely for such a toy model to advance capabilities.
The analogy to gain of function research does give me pause, though. I will have to think about what that way of thinking about it suggests.
My first thought I guess is that code is a little bit like a virus these days in terms of its ability to propagate itself—anything I post on colab could theoretically find its way into a Copilot-esque service (internal or external) from Google, and thence fragments of it could wind up in various programs written by people using such a service, and so on and so on. Which is a little bit scary I suppose, if I’m intentionally implementing tiny fragments of something scary.
I’m not saying this isn’t helpful work worth the tradeoff, but this sounds like very early stage gain of function research? Have you thought about the risks of this line of investigation turning into that? :)
I had been thinking about it in terms of capabilities research—is this likely to lead to capabilities advancements? My gut says that it is highly unlikely for such a toy model to advance capabilities.
The analogy to gain of function research does give me pause, though. I will have to think about what that way of thinking about it suggests.
My first thought I guess is that code is a little bit like a virus these days in terms of its ability to propagate itself—anything I post on colab could theoretically find its way into a Copilot-esque service (internal or external) from Google, and thence fragments of it could wind up in various programs written by people using such a service, and so on and so on. Which is a little bit scary I suppose, if I’m intentionally implementing tiny fragments of something scary.
Oof.
Posted a question about this here: https://www.lesswrong.com/posts/Zrn8JBQKMs4Ho5oAZ/is-ai-gain-of-function-research-a-thing