(status: uninterpretable for 2⁄4 reviewers, the understanding two being friends who are used to my writing style; i’ll aim to write something that makes this concept simple to read)
‘Platonic’ is a categorization I use internally, and my agenda is currently the search for methods to ensure AI/ASI will have this property.
With this word, I mean this category acceptance/rejection: ✅ Has no goals
✅ Has goals about what to do in isolation. Example: “in isolation from any world, (try to) output A”[1]
❌ Has goals related to physical world states. Example: “(try to) ensure A gets stored in memory on the computer in the physical world that’s computing my output.”[2]
A can be ‘the true answer to the input question’, ‘a proof of x conjecture’, ‘the most common next symbol in x world prior to my existence in it’, etc.
A more human-intuitive transcription may include wording like: “try to be the kind of program/function which would (in isolation from any particular worldstate/physics) output A.”
I’m leaving this as a footnote because it can also confuse people, leading to questions like “What does it mean to ‘try to be a kind of program’ when it’s already determined what kind of program it is?”
This class of unaligned ‘physical goals’ is dangerous because if the system can’t determine A, its best method to fulfill the goal is through instrumental convergence.
Platonism
(status: uninterpretable for 2⁄4 reviewers, the understanding two being friends who are used to my writing style; i’ll aim to write something that makes this concept simple to read)
‘Platonic’ is a categorization I use internally, and my agenda is currently the search for methods to ensure AI/ASI will have this property.
With this word, I mean this category acceptance/rejection:
✅ Has no goals
✅ Has goals about what to do in isolation. Example: “in isolation from any world, (try to) output A”[1]
❌ Has goals related to physical world states. Example: “(try to) ensure A gets stored in memory on the computer in the physical world that’s computing my output.”[2]
A can be ‘the true answer to the input question’, ‘a proof of x conjecture’, ‘the most common next symbol in x world prior to my existence in it’, etc.
As written here, this is a class of outer alignment solution. I need to write about why I believe it’s a more reachable target for ‘inner alignment’/‘training stories’, too.
A more human-intuitive transcription may include wording like: “try to be the kind of program/function which would (in isolation from any particular worldstate/physics) output A.”
I’m leaving this as a footnote because it can also confuse people, leading to questions like “What does it mean to ‘try to be a kind of program’ when it’s already determined what kind of program it is?”
This class of unaligned ‘physical goals’ is dangerous because if the system can’t determine A, its best method to fulfill the goal is through instrumental convergence.