Noosphere89 comments on If we solve alignment, do we die anyway?

Noosphere89 19 Nov 2024 18:42 UTC
4 points
1

I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.

Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?

My proposal is to restrain the AI monitor’s domain only.

I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
- Dakara 19 Nov 2024 18:48 UTC
  1 point
  0
  Parent
  That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
  
  I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
  
  EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
  - Seth Herd 19 Nov 2024 19:06 UTC
    3 points
    1
    Parent
    That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
    - Dakara 19 Nov 2024 19:26 UTC
      1 point
      0
      Parent
      Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
      
      I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?