Vaniver is right. Note that I did specifically describe myself as an “alignment insider” at the start of this post. I’ve read A List of Lethalities and lots of other writing by Yudkowsky. Though the post I’d cite in response to the “you’re not engaging with the strongest forms of my argument” claim would be the one where I pretty much did what Yudkowsky suggests:
To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
and then argues that we shouldn’t use evolution as our central example of an “outer optimization criteria versus inner formed values” outcome.
You can also see my comment here for some of what led me to write about the podcast specifically.
Vaniver is right. Note that I did specifically describe myself as an “alignment insider” at the start of this post. I’ve read A List of Lethalities and lots of other writing by Yudkowsky. Though the post I’d cite in response to the “you’re not engaging with the strongest forms of my argument” claim would be the one where I pretty much did what Yudkowsky suggests:
My post Evolution is a bad analogy for AGI: inner alignment specifically addresses List of Lethalities point 16:
and then argues that we shouldn’t use evolution as our central example of an “outer optimization criteria versus inner formed values” outcome.
You can also see my comment here for some of what led me to write about the podcast specifically.