Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
They were separated by spaces. (But I’d encourage replication before updating too hard on results which I think are very weird.)