[4] AI safety relevant side note: The idea that translations of meaning need only be sufficiently reliable in order to be reliably useful might provide an interesting avenue for AI safety research. [...]
I also see this as an interesting (and pragmatic) research direction. However, I think its usefulness hinges on ability to robustly quantify the required alignment reliability / precision for various levels of optimization power involved. Only then it may be possible to engineer demonstrably safe scenarios with alignment quality tracking the optimization power growth (made easier by a slower opt. power growth).
I think that this is a very hard task, both for unknown unknowns around strong optimizers, practically (we still need increasingly good alignment), and also fundamentally (if we can define this rigorously, perfect alignment would be a natural limit).
On the other hand, a cascade of practically sufficient alignment mechanisms is one of my favorite ways to interpret Paul’s IDA (Iterated Distillation-Amplification), this happening sufficiently reliably at every distillation step.
Re language as an example: parties involved in communication using language have comparable intelligence (and even there I would say someone just a bit smarter can cheat their way around you using language).
Re language as an example: parties involved in communication using language have comparable intelligence (and even there I would say someone just a bit smarter can cheat their way around you using language).
Mhh yeah so I agree these examples of ways in which language “fails”. But I think they don’t bother me too much? I put them in the same category as “two agents with good faith sometimes miscommunicate—and still, language overall is pragmatically”, or “works good enough”. In other words, even though there is potential for exploitation, that potential is in fact meaningfully constraint. More importantly, I would argue that the constraint comes (in large parts) from the way the language has been (co-)constructed.
However, I think its usefulness hinges on ability to robustly quantify the required alignment reliability / precision for various levels of optimization power involved.
I agree and think this is a good point! I think on top of quantifying the required alignment reliability “at various levels of optimization” it would also be relevant to take the underlying territory/domain into account. We can say that a territory/domain has a specific epistemic and normative structure (which e.g. defines the error margin that is acceptable, or tracks the co-evolutionary dynamics).
I also see this as an interesting (and pragmatic) research direction. However, I think its usefulness hinges on ability to robustly quantify the required alignment reliability / precision for various levels of optimization power involved. Only then it may be possible to engineer demonstrably safe scenarios with alignment quality tracking the optimization power growth (made easier by a slower opt. power growth).
I think that this is a very hard task, both for unknown unknowns around strong optimizers, practically (we still need increasingly good alignment), and also fundamentally (if we can define this rigorously, perfect alignment would be a natural limit).
On the other hand, a cascade of practically sufficient alignment mechanisms is one of my favorite ways to interpret Paul’s IDA (Iterated Distillation-Amplification), this happening sufficiently reliably at every distillation step.
Re language as an example: parties involved in communication using language have comparable intelligence (and even there I would say someone just a bit smarter can cheat their way around you using language).
Mhh yeah so I agree these examples of ways in which language “fails”. But I think they don’t bother me too much?
I put them in the same category as “two agents with good faith sometimes miscommunicate—and still, language overall is pragmatically”, or “works good enough”. In other words, even though there is potential for exploitation, that potential is in fact meaningfully constraint. More importantly, I would argue that the constraint comes (in large parts) from the way the language has been (co-)constructed.
Yeah, great point!
I agree and think this is a good point! I think on top of quantifying the required alignment reliability “at various levels of optimization” it would also be relevant to take the underlying territory/domain into account. We can say that a territory/domain has a specific epistemic and normative structure (which e.g. defines the error margin that is acceptable, or tracks the co-evolutionary dynamics).