Daniel Kokotajlo comments on Sam Marks’s Shortform

Daniel Kokotajlo 27 Apr 2022 21:39 UTC
2 points
Thanks! I take the point about animals and deception.
Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Insofar as the pitch for RLHF is “Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we’ll be able to warn them of the dangers” then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk.
I don’t think that there are failures that will occur in powerful systems that won’t occur in any intermediate system. However I’m skeptical that the failures that will occur in powerful systems will also occur in today’s systems. I must say I’m super uncertain about all of this and haven’t thought about it very much.
With that preamble aside, here is some wild speculation:
--Current systems (hopefully?) aren’t reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals… but as far as we know it isn’t doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of “the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment” just doesn’t apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it’ll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers.
--As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that’s what good people do because that’s what God says; later on they stop believing in God and stop going to church. And then later still they do some philosophy and adopt some weird ethical theory like utilitarianism and their behavior changes accordingly. Well, what if AIs undergo similar ontological shifts as they get smarter? Then maybe the stuff that works at one level of intelligence will stop working at another. (e.g. telling a kid that God is watching them and He says they should go to church stops working. Later when they become a utilitarian, telling them that killing civilians is murder and murder is wrong stops working too (if they are in a circumstance where the utilitarian calculus says civilian casualties are worth it for the greater good)).
- Not Relevant 27 Apr 2022 23:42 UTC
  3 points
  Parent
  I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
  Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.