Curated. [Edit: no longer particularly endorsed in light of Rohin’s comment, although I also have not yet really vetted Rohin’s comment either and currently am agnostic on how important this post is]
When I first started following LessWrong, I thought the sequences made a good theoretical case for the difficulties of AI Alignment. In the past few years we’ve seen more concrete, empirical examples of how AI progress can take shape and how that might be alarming. We’ve also seen more concrete simple examples of AI failure in the form of specification gaming and whatnot.
I haven’t been following all of this in depth and don’t know how novel the claims here are [fake edit: gwern notes in the comments that similar phenomena have been observed elsewhere]. But, this seemed noteworthy for getting into the empirical observation of some of the more complex concerns about inner alignment.
I’m interested in seeing more discussion of these results, what they mean and how people think about them.
Curated. [Edit: no longer particularly endorsed in light of Rohin’s comment, although I also have not yet really vetted Rohin’s comment either and currently am agnostic on how important this post is]
When I first started following LessWrong, I thought the sequences made a good theoretical case for the difficulties of AI Alignment. In the past few years we’ve seen more concrete, empirical examples of how AI progress can take shape and how that might be alarming. We’ve also seen more concrete simple examples of AI failure in the form of specification gaming and whatnot.
I haven’t been following all of this in depth and don’t know how novel the claims here are [fake edit: gwern notes in the comments that similar phenomena have been observed elsewhere]. But, this seemed noteworthy for getting into the empirical observation of some of the more complex concerns about inner alignment.
I’m interested in seeing more discussion of these results, what they mean and how people think about them.