That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?
That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?