Note that the “without countermeasures” post consistently discusses both possibilities
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).