I sat down and thought about alignment (by the clock!) for a while today and came up with an ELK breaker that has probably been addressed elsewhere, and I wanted to know if someone had seen it before.
So, my understanding of ELK is the idea is that we want our model to tell us what it actually knows about the diamond, not what it thinks we want to hear. My question is—how does the AI specify this objective?
I can think of two ways, both bad.
1) AI aims to provide the most accurate knowledge of its state possible. Breaker: AI likely provides something uninterpretible by humans.
2) AI aims to maximise human understanding. Breaker: This runs into a recursive ELK problem. How does the AI know we’ve understood? Because we tell it we did. So the AI ends up optimising for us thinking we’ve understood the problem.
Is this previously trodden ground? Has someone extended this line of reasoning a few steps further?
I sat down and thought about alignment (by the clock!) for a while today and came up with an ELK breaker that has probably been addressed elsewhere, and I wanted to know if someone had seen it before.
So, my understanding of ELK is the idea is that we want our model to tell us what it actually knows about the diamond, not what it thinks we want to hear. My question is—how does the AI specify this objective?
I can think of two ways, both bad.
1) AI aims to provide the most accurate knowledge of its state possible. Breaker: AI likely provides something uninterpretible by humans.
2) AI aims to maximise human understanding. Breaker: This runs into a recursive ELK problem. How does the AI know we’ve understood? Because we tell it we did. So the AI ends up optimising for us thinking we’ve understood the problem.
Is this previously trodden ground? Has someone extended this line of reasoning a few steps further?