We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.
As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.
I generally don’t read links when there’s no context provided, and think it’s almost always worth it (from a cooperative perspective) to provide a bit of context.
Can you give me a TL;DR of why this is relevant or what your point is in posting this link?
The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to ‘Reframing Impact’:
The TL;DR seems to be: “We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans.”
This seems good… can you confirm my understanding below is correct?
2) RE: “How much utility is available”: I guess we can just set a targeted level of utility gain, and it won’t matter if there are plans we’d consider reasonable that would exceed that level? (e.g. “I’d be happy if we can make 50% more paperclips at the same cost in the next year.”)
1) RE: “A lower bound”: this seems good because we don’t need to know how extreme catastrophes could be, we can just say: “If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic”.
As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.
See How Low Should Fruit Hang Before We Pick It?.
I generally don’t read links when there’s no context provided, and think it’s almost always worth it (from a cooperative perspective) to provide a bit of context.
Can you give me a TL;DR of why this is relevant or what your point is in posting this link?
The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to ‘Reframing Impact’:
OK, thanks.
The TL;DR seems to be: “We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans.”
This seems good… can you confirm my understanding below is correct?
2) RE: “How much utility is available”: I guess we can just set a targeted level of utility gain, and it won’t matter if there are plans we’d consider reasonable that would exceed that level? (e.g. “I’d be happy if we can make 50% more paperclips at the same cost in the next year.”)
1) RE: “A lower bound”: this seems good because we don’t need to know how extreme catastrophes could be, we can just say: “If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic”.
That’s correct.