a possible research direction which i don’t know if anyone has explored: what would a training setup which provably creates a (probably[1]) aligned system look like?
my current intuition, which is not good evidence here beyond elevating the idea from noise, is that such a training setup might somehow leverage how the training data and {subsequent-agent’s perceptions/evidence stream} are sampled from the same world, albeit with different sampling procedures. for example, the training program could intake both a dataset and an outer-alignment-goal-function, and select for prediction of the dataset (to build up ability) while also doing something else to the AI-in-training; i have no idea what that something else would look like (and it seems like most of this problem).
has this been thought about before? is this feasible? why or why not?
(i can clarify if any part of this is not clear.)
(background motivator: in case there is no finite-length general purpose search algorithm[2], alignment may have to be of trained systems / learners)
(because in principle, it’s possible to get unlucky with sampling for the dataset. compare: it’s possible for an unlucky sequence of evidence to cause an agent to take actions which are counter to its goal.)
by which i mean a program capable of finding something which meets any given criteria met by at least one thing (or writing ‘undecidable’ in self-referential edge cases)
For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty.
But maybe you only want to “prove” inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.
a possible research direction which i don’t know if anyone has explored: what would a training setup which provably creates a (probably[1]) aligned system look like?
my current intuition, which is not good evidence here beyond elevating the idea from noise, is that such a training setup might somehow leverage how the training data and {subsequent-agent’s perceptions/evidence stream} are sampled from the same world, albeit with different sampling procedures. for example, the training program could intake both a dataset and an outer-alignment-goal-function, and select for prediction of the dataset (to build up ability) while also doing something else to the AI-in-training; i have no idea what that something else would look like (and it seems like most of this problem).
has this been thought about before? is this feasible? why or why not?
(i can clarify if any part of this is not clear.)
(background motivator: in case there is no finite-length general purpose search algorithm[2], alignment may have to be of trained systems / learners)
(because in principle, it’s possible to get unlucky with sampling for the dataset. compare: it’s possible for an unlucky sequence of evidence to cause an agent to take actions which are counter to its goal.)
by which i mean a program capable of finding something which meets any given criteria met by at least one thing (or writing ‘undecidable’ in self-referential edge cases)
For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty. But maybe you only want to “prove” inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.
correct, i’m imagining these being solved separately