I’ve been thinking about exercises for alignment, and I think going through a list of lethalities and applying them to an alignment propsal would be a good one. Doing the same with Paul’s list would be a bonus challenge. If I had some pre-written answer sheet for one proposal, I could try the exercise my self to see how useful it would be. This post, which I haven’t read yet, looks like it would serve for the case of RLHF. I’ll try it tomorrow and report back here.
I’ve been thinking about exercises for alignment, and I think going through a list of lethalities and applying them to an alignment propsal would be a good one. Doing the same with Paul’s list would be a bonus challenge. If I had some pre-written answer sheet for one proposal, I could try the exercise my self to see how useful it would be. This post, which I haven’t read yet, looks like it would serve for the case of RLHF. I’ll try it tomorrow and report back here.