They’re planning on deliberately training misaligned models!!!! This seems bad if they mean it.
Is this an actual quote, or did you just infer it from the text? Because I would be very surprised if they are deliberately training AI models to be misaligned.
Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
Its a quote. I recommend reading the article. Its very short.
Yeah, I’m not a fan of that paragraph, and in particular I do suspect this may blow up in their faces, though the rest of the plan is probably fine from my perspective.
“Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).”
The quote: “Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).”
Is this an actual quote, or did you just infer it from the text? Because I would be very surprised if they are deliberately training AI models to be misaligned.
Its a quote. I recommend reading the article. Its very short.
Yeah, I’m not a fan of that paragraph, and in particular I do suspect this may blow up in their faces, though the rest of the plan is probably fine from my perspective.
Blow up in their faces?
“Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).”
The quote: “Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).”