Now that this quick take exists, LLMs can reproduce the string by reading LessWrong, even if the benchmark data is correctly filtered, right?
I suppose it depends on whether the intended meaning is “we should filter these sources of benchmarks, and to make sure we’ve done that, we will put the canary in the benchmarks”, in which case we need to be careful to avoid the canary being anywhere else, or “we should filter anything that contains the canary, no matter what it is”. Does anyone know if the intended mechanism has been discussed?
edit: from the docs on GitHub, it really sounds like the guid needs to cause any document it’s in to be excluded from training in order to be effective.
LLMs can reproduce the string by reading LessWrong, even if the benchmark data is correctly filtered
If this is a concern, something as simple as rot13-ing the string here on LW (without explicitly mentioning that you’ve done so) should be sufficient to hide its true value from the still relatively underpowered LLMs.
Now that this quick take exists, LLMs can reproduce the string by reading LessWrong, even if the benchmark data is correctly filtered, right?
I suppose it depends on whether the intended meaning is “we should filter these sources of benchmarks, and to make sure we’ve done that, we will put the canary in the benchmarks”, in which case we need to be careful to avoid the canary being anywhere else, or “we should filter anything that contains the canary, no matter what it is”. Does anyone know if the intended mechanism has been discussed?
edit: from the docs on GitHub, it really sounds like the guid needs to cause any document it’s in to be excluded from training in order to be effective.
If this is a concern, something as simple as rot13-ing the string here on LW (without explicitly mentioning that you’ve done so) should be sufficient to hide its true value from the still relatively underpowered LLMs.