Daniel Kokotajlo comments on How do you feel about LessWrong these days? [Open feedback thread]

Daniel Kokotajlo 31 Jan 2024 6:20 UTC
2 points
0
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.

I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.

I’m in neither category (1) or (2); it’s a false dichotomy.
- Matthew Barnett 31 Jan 2024 7:11 UTC
  2 points
  0
  Parent
  I’m in neither category (1) or (2); it’s a false dichotomy.
  The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
  But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
  (The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
  - Daniel Kokotajlo 31 Jan 2024 16:38 UTC
    6 points
    4
    Parent
    I’m not updating significantly because things have gone basically exactly as I expected.
    
    As for when RLHF will break down, two points:
    
    (1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
    
    (2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
    
    (Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
    - Daniel Kokotajlo 1 Feb 2024 17:32 UTC
      2 points
      0
      Parent
      Re: Missing the point: How?
      
      Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?