Another possible configuration of the untrusted smart model / trusted weak model setup is to have the weak model be the “driver” which makes most or all of the relevant choices (in an agent scaffold, say) and relies on the untrusted model (or humans) only for hints.
For example, the untrusted model could be used only to provide a probability distribution over the immediate next token in some context, which is supplied to the weaker model as a hint. The weaker model can only make use of the hint if it can understand the purpose of the hint, and explain the relevance / correctness of the hint in natural language.
(Summary / spoiler: it doesn’t work particularly well when applied to current models, but might work better as both the trusted and untrusted models get more capable.)
I think this kind of configuration is one end of the extreme in the design space you outline—very inefficient in terms of the speed and quality penalties it imposes relative to using a model in an unrestricted setup, but restrictive enough on the untrusted model that it could be safe to use even on models with very high underlying capabilities.
Another possible configuration of the untrusted smart model / trusted weak model setup is to have the weak model be the “driver” which makes most or all of the relevant choices (in an agent scaffold, say) and relies on the untrusted model (or humans) only for hints.
For example, the untrusted model could be used only to provide a probability distribution over the immediate next token in some context, which is supplied to the weaker model as a hint. The weaker model can only make use of the hint if it can understand the purpose of the hint, and explain the relevance / correctness of the hint in natural language.
I explored this setup in more detail here.
(Summary / spoiler: it doesn’t work particularly well when applied to current models, but might work better as both the trusted and untrusted models get more capable.)
I think this kind of configuration is one end of the extreme in the design space you outline—very inefficient in terms of the speed and quality penalties it imposes relative to using a model in an unrestricted setup, but restrictive enough on the untrusted model that it could be safe to use even on models with very high underlying capabilities.