doesn’t really deal with issues relating to what happens as the world is filled with more adversarial data generated by agents trying to exploit other agents.
Like you say, the work here is mainly intended towards understanding model function in non-adversarial settings. But, fluent redteaming is a very closely related topic that we’re working on. In that area, there’s a trend towards using perplexity/cross-entropy filters to slice off a chunk of the attack surface (e.g. https://arxiv.org/abs/2308.14132). If you know that 99.99% of user queries to a chatbot have a cross-entropy below X then you can set up a simple filter to reject queries with cross-entropy higher than X. So, useful text-based adversarial attacks will very soon start requiring some level of fluency.
which seems useful for doing analysis on how the dataset landed in the model.
Yes! This is exactly how I think about the value of dreaming. Poking at the edges of the behavior of a feature/circuit/component lets you get a more robust sense of what that component is doing.
Thanks!
Like you say, the work here is mainly intended towards understanding model function in non-adversarial settings. But, fluent redteaming is a very closely related topic that we’re working on. In that area, there’s a trend towards using perplexity/cross-entropy filters to slice off a chunk of the attack surface (e.g. https://arxiv.org/abs/2308.14132). If you know that 99.99% of user queries to a chatbot have a cross-entropy below X then you can set up a simple filter to reject queries with cross-entropy higher than X. So, useful text-based adversarial attacks will very soon start requiring some level of fluency.
Yes! This is exactly how I think about the value of dreaming. Poking at the edges of the behavior of a feature/circuit/component lets you get a more robust sense of what that component is doing.