Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly “pass” the course, you would.
I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things “goals” (the things you really want for their own sake) and “conditions” (the things you need to do for some external reason)
My inner MIRI says—we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.
Sorry it took me a while to get to this.
Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly “pass” the course, you would.
I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things “goals” (the things you really want for their own sake) and “conditions” (the things you need to do for some external reason)
My inner MIRI says—we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.