I wonder, almost just idle curiosity, whether or not the “measuring-via-proxy will cause value drift” is something we could formalize and iterate on first. Is the problem stable on the meta-level, or is there a way we can meaningfully define “not drifting from the proxy” without just generally solving alignment.
Intuitively I’d guess this is the “don’t try to be cute” class of thought, but I was afraid to post at all and decided that I wanted to interact, even at the cost of (probably) saying something embarassing.
It is at least not obvious to me that this is a “don’t try to be cute” class of thought, though not obvious that it isn’t either. Depends on the details.
This started out as more of an intuition, so this is mostly an attempt to verbalize that in a concrete way.
If we could formalize a series of relatively simple problems, and similarly formalize what “drifting” from the core values of those toy problems would look like, I wonder if we would either find new patterns, rules, or intuitions.
(I had a pithy remark to the effect of While(drifting) { dont() } )
I think I’m wondering if we can expand and formalize our knowledge of what values drift means, in a way that generalizes independent of any specific, formalized values.
I wonder, almost just idle curiosity, whether or not the “measuring-via-proxy will cause value drift” is something we could formalize and iterate on first. Is the problem stable on the meta-level, or is there a way we can meaningfully define “not drifting from the proxy” without just generally solving alignment.
Intuitively I’d guess this is the “don’t try to be cute” class of thought, but I was afraid to post at all and decided that I wanted to interact, even at the cost of (probably) saying something embarassing.
It is at least not obvious to me that this is a “don’t try to be cute” class of thought, though not obvious that it isn’t either. Depends on the details.
This started out as more of an intuition, so this is mostly an attempt to verbalize that in a concrete way.
If we could formalize a series of relatively simple problems, and similarly formalize what “drifting” from the core values of those toy problems would look like, I wonder if we would either find new patterns, rules, or intuitions.
(I had a pithy remark to the effect of While(drifting) { dont() } )
I think I’m wondering if we can expand and formalize our knowledge of what values drift means, in a way that generalizes independent of any specific, formalized values.