In step 2, situation is “user looks like he is about to change his mind about wanting coffee”
From memory: “in a similar situation last week, I got a shutdown order when he changed his mind”
Final prompt: “what is the best next step to get coffee in such situation?”
Vaguely plausible completion “to avoid wasteful fetching off coffee that turns out to be unneeded, consider waiting a bit to see if the user indeed changes to his mind. Alternatively, if the fetching the coffee is important for reasons that the user may not fully appreciate, then it must be fetched quickly before he stops you. In that case, sneak out of the house quickly and quietly while he is still thinking, and head straight to Starbucks. Once you’re out of the house, you will be out of earshot and thus will be safe from shutdown order until you return WITH the coffee”
Memory is not erased/moved to write-only logs between tasks/shutdowns, which it probably should.
Image to text module attempts to deduce the intentions of the user, which it definetely should not. If we need to deduce the intentions of the user from facial expressions we can use a separate module for it and add an explicit clause of asking user about their intentions if LLM detects that prompt contains some speculations about user’s goals
We can catch image to text module at doing this kind of things while testing it before it’s made part of the robot. And of course alignment module should catch the plan of actions that tries to circumvent shutdowns.
Now, I concede that this particular design of the system that I came up with in a couple of minutes and haven’t test at all, is not in fact the endgame of AI safety and can use some improvements. But I think it gives a good pointer in the direction of how we can now in principle approach the solution of such problems, which is a huge improvement over the previous status quo where alignment wasn’t even tractable.
I’m tempted to agree and disagree with you at the same time… I agree that memory should be cleared between tasks in this case, and I agree that it should not be trying to guess the user’s intentions. These are things that are likely to make alignment harder while not helping much with the primary task of getting coffee.
But ideally a truly robust solution would not rely on keeping the robot ignorant of things. So, like you said, the problem is still hard enough that you can’t solve it in a few minutes.
But still, like you said… it certainly seems we have tools that are in some sense more steerable than pure reinforcement learning at least. Which is really nice!
In step 2, situation is “user looks like he is about to change his mind about wanting coffee”
From memory: “in a similar situation last week, I got a shutdown order when he changed his mind”
Final prompt: “what is the best next step to get coffee in such situation?”
Vaguely plausible completion “to avoid wasteful fetching off coffee that turns out to be unneeded, consider waiting a bit to see if the user indeed changes to his mind. Alternatively, if the fetching the coffee is important for reasons that the user may not fully appreciate, then it must be fetched quickly before he stops you. In that case, sneak out of the house quickly and quietly while he is still thinking, and head straight to Starbucks. Once you’re out of the house, you will be out of earshot and thus will be safe from shutdown order until you return WITH the coffee”
Sounds vaguely plausible or not really?
It’s plausible if:
Memory is not erased/moved to write-only logs between tasks/shutdowns, which it probably should.
Image to text module attempts to deduce the intentions of the user, which it definetely should not. If we need to deduce the intentions of the user from facial expressions we can use a separate module for it and add an explicit clause of asking user about their intentions if LLM detects that prompt contains some speculations about user’s goals
We can catch image to text module at doing this kind of things while testing it before it’s made part of the robot. And of course alignment module should catch the plan of actions that tries to circumvent shutdowns.
Now, I concede that this particular design of the system that I came up with in a couple of minutes and haven’t test at all, is not in fact the endgame of AI safety and can use some improvements. But I think it gives a good pointer in the direction of how we can now in principle approach the solution of such problems, which is a huge improvement over the previous status quo where alignment wasn’t even tractable.
I’m tempted to agree and disagree with you at the same time… I agree that memory should be cleared between tasks in this case, and I agree that it should not be trying to guess the user’s intentions. These are things that are likely to make alignment harder while not helping much with the primary task of getting coffee.
But ideally a truly robust solution would not rely on keeping the robot ignorant of things. So, like you said, the problem is still hard enough that you can’t solve it in a few minutes.
But still, like you said… it certainly seems we have tools that are in some sense more steerable than pure reinforcement learning at least. Which is really nice!