I have no idea what I mean, on further reflection. I’m as confused as you are on why this is hard if we have an accurate utility function sitting right there. Maybe the idea is that subject to optimization pressure it would fail?
Yeah so I think that’s what the adversarial example/OOD people worry about.
That just seems… like it buys you a lot? And like we should focus more on those problems specifically.
I have no idea what I mean, on further reflection. I’m as confused as you are on why this is hard if we have an accurate utility function sitting right there. Maybe the idea is that subject to optimization pressure it would fail?
Yeah so I think that’s what the adversarial example/OOD people worry about. That just seems… like it buys you a lot? And like we should focus more on those problems specifically.