It seems to me that “avoid irreversible high-impact actions” would only work if one had a small amount of uncertainty over one’s utility function, in which case you could just avoid actions that are considered “irreversible high-impact” by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn’t work because almost any action could be “irreversible high-impact”.
From the AUP perspective, this only seems true in a way analogous to the statement that “any hypothesis can have arbitrarily long description length”. It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of “low impact”. That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable.
While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.
From the AUP perspective, this only seems true in a way analogous to the statement that “any hypothesis can have arbitrarily long description length”. It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of “low impact”. That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable.
While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.