I dunno, the productivity hacks thing sounds pretty bad.
But yeah, doing better seems to be held up by the fact that we don’t yet have a coherent way to describe the standards for doing better, when the human isn’t an idealized sort of agent. Trying to steer the agent towards thinking of its goal as “do what the programmers want” is essentially talking about a machine-learning method of trying to find this description.
I dunno, the productivity hacks thing sounds pretty bad.
Well, we ought to be able to either figure out how to use this kind of system safely, or prove it’s impossible. Either would be valuable. :-)
I don’t think it’s obviously impossible though. In particular, with the right motivation, it won’t be motivated to undermine the steering signals. And also, the subcortex can be a slightly-less powerful AI, assisted by intrusive interpretability tools, multiple copies running faster, etc.
But yeah, doing better seems to be held up by the fact that we don’t yet have a coherent way to describe the standards for doing better, when the human isn’t an idealized sort of agent...
Yeah, I struggle with that too. Maybe an alternative (or at least starting point) would be to try to solve the challenge of building a question-answering oracle that has no motivation to lie or manipulate or escape its box, etc. I think that is a goal I can fully understand, although maybe I just haven’t thought about it carefully enough to find the edge cases. :-)
I dunno, the productivity hacks thing sounds pretty bad.
But yeah, doing better seems to be held up by the fact that we don’t yet have a coherent way to describe the standards for doing better, when the human isn’t an idealized sort of agent. Trying to steer the agent towards thinking of its goal as “do what the programmers want” is essentially talking about a machine-learning method of trying to find this description.
Well, we ought to be able to either figure out how to use this kind of system safely, or prove it’s impossible. Either would be valuable. :-)
I don’t think it’s obviously impossible though. In particular, with the right motivation, it won’t be motivated to undermine the steering signals. And also, the subcortex can be a slightly-less powerful AI, assisted by intrusive interpretability tools, multiple copies running faster, etc.
Yeah, I struggle with that too. Maybe an alternative (or at least starting point) would be to try to solve the challenge of building a question-answering oracle that has no motivation to lie or manipulate or escape its box, etc. I think that is a goal I can fully understand, although maybe I just haven’t thought about it carefully enough to find the edge cases. :-)