I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).
I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).