Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post
Request: also define ‘representation’ in your next post. I keep seeing humans writing about concepts and representations as if they’re necessary to intelligence, and it’s not clear to me exactly what they mean by those terms (but I suspect they think of their underlying cognition as being based on those.).
Also, an explanation of why you expect a trained ASI to necessarily have terminal values to begin with (if you do) would be helpful. E.g., what of training criteria where being a world-valuing-agent does not increase performance? E.g., where the AI is just repeatedly answering one particular class of questions—why would it be an agent, rather than {the structure of the process which produces the answer to that class of questions}, that an agent would also use if it wanted to answer a question in that class, but which itself is decomposable from agency and in a trivial way simpler than {itself plus agentic structure}?
This post is the version of Yudkowsky’s argument for inner misalignment that I wish I’d had in my head a few years ago. I don’t claim that it’s novel, that I endorse it, or even that Yudkowsky would endorse it; it’s primarily an attempt to map his ideas into an ontology that makes sense to me (and hopefully others).
I’d like to read Yudkowsky’s current endorsed beliefs about this, if they’re available online anywhere.
Request: also define ‘representation’ in your next post. I keep seeing humans writing about concepts and representations as if they’re necessary to intelligence, and it’s not clear to me exactly what they mean by those terms (but I suspect they think of their underlying cognition as being based on those.).
Also, an explanation of why you expect a trained ASI to necessarily have terminal values to begin with (if you do) would be helpful. E.g., what of training criteria where being a world-valuing-agent does not increase performance? E.g., where the AI is just repeatedly answering one particular class of questions—why would it be an agent, rather than {the structure of the process which produces the answer to that class of questions}, that an agent would also use if it wanted to answer a question in that class, but which itself is decomposable from agency and in a trivial way simpler than {itself plus agentic structure}?
I’d like to read Yudkowsky’s current endorsed beliefs about this, if they’re available online anywhere.