My latest & greatest project proposal, in case people want to know what I’m doing, or give me money. There will likely be a LessWrong post up soon where I explain in a more friendly way my thoughts.
Over the next year I propose to study the development and determination of values in RL & supervised learning agents, and to expand the experimental methods & theory of singular learning theory (a theory of supervised learning) to the reinforcement learning case.
All arguments for why we should expect AI to result in an existential risk rely on AIs having values which are different from ours. If we could make a good empirically & mathematically grounded theory for the development of values during training, we could create a training story which we could have high confidence would result in an inner-aligned AI. I also find it likely reinforcement learning (as a significant component of training AIs) makes a come-back in some fashion, and such a world is much more worrying than if we just continue with our almost entirely supervised learning training regime.
However, previous work in this area is not only sparse, but either solely theoretical or solely empirical, with few attempts or plans to bridge the gap. Such a bridge is however necessary to achieve the goals in the previous paragraph with confidence.
I think I personally am suited to tackle this problem, having already been working on this project for the past 6 months, having both experience in ML research in the past, and extensive knowledge of a wide variety of areas of applied math.
I also believe that given my limited requests for resources, I’ll be able to make claims which apply to a wide variety of RL setups, as it has generally been the case in ML that the differences between scales is only that: scale. Along with a strong theoretical component, I will be able to say when my conclusions hold, and when they don’t.
My latest & greatest project proposal, in case people want to know what I’m doing, or give me money. There will likely be a LessWrong post up soon where I explain in a more friendly way my thoughts.
And here is that post